### Import Library


- import ast: Library ast (Abstract Syntax Trees) di Python membantu dalam mengurai kode Python. Dalam konteks pemrosesan teks, ini sering digunakan untuk mengkonversi representasi string dari struktur data Python (seperti daftar atau kamus) kembali ke struktur data yang sebenarnya. Anda menggunakannya sebelumnya untuk mengubah string representasi list di kolom text_lemmatized menjadi list Python yang sebenarnya.

- from sklearn.feature_extraction.text import TfidfVectorizer: Ini mengimpor kelas TfidfVectorizer dari modul feature_extraction.text di library scikit-learn (sklearn). TfidfVectorizer adalah alat yang sangat umum digunakan dalam pemrosesan bahasa alami untuk menghitung skor TF-IDF (Term Frequency-Inverse Document Frequency) dari koleksi dokumen teks. Ini menggabungkan langkah tokenisasi, perhitungan term frequency, dan perhitungan inverse document frequency ke dalam satu objek yang mudah digunakan.

In [34]:
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import drive
import ast
from sklearn.feature_extraction.text import TfidfVectorizer

### Retrive Data

memuat data dari file CSV yang terletak di Google Drive Anda ke dalam sebuah DataFrame pandas dan menampilkan lima baris pertamanya.

In [22]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [23]:
path = pd.read_csv('/content/drive/MyDrive/COLLEGE/SEMESTER_7/pemrosesan_teks/EDA/path_trigram_bigram.csv')
path.head()

Unnamed: 0,text_cleaned,text_tokens,text_stemmed,text_lemmatized,text_removal,text_bigrams,text_trigrams
0,nangkep bjorka ngakak guling kalo maling ayam ...,"['nangkep', 'bjorka', 'ngakak', 'guling', 'kal...","['nangkep', 'bjorka', 'ngakak', 'gule', 'kalo'...","['nangkep', 'bjorka', 'ngakak', 'gule', 'kalo'...","['nangkep', 'bjorka', 'ngakak', 'gule', 'kalo'...",nangkep_bjorka bjorka_ngakak ngakak_guling gul...,nangkep_bjorka_ngakak bjorka_ngakak_guling nga...
1,mas nya ini masa juga g bisa mendeteksi bjorka...,"['mas', 'nya', 'ini', 'masa', 'juga', 'g', 'bi...","['ma', 'nya', 'ini', 'masa', 'juga', 'g', 'bis...","['ma', 'nya', 'ini', 'masa', 'juga', 'g', 'bis...","['ma', 'nya', 'g', 'mendeteksi', 'bjorkabantu'...",mas_nya nya_ini ini_masa masa_juga juga_g g_bi...,mas_nya_ini nya_ini_masa ini_masa_juga masa_ju...
2,kek udah tenang gtu kan gak viral si bjorka eh...,"['kek', 'udah', 'tenang', 'gtu', 'kan', 'gak',...","['kek', 'udah', 'tenang', 'gtu', 'kan', 'gak',...","['kek', 'udah', 'tenang', 'gtu', 'kan', 'gak',...","['kek', 'udah', 'tenang', 'gtu', 'gak', 'viral...",kek_udah udah_tenang tenang_gtu gtu_kan kan_ga...,kek_udah_tenang udah_tenang_gtu tenang_gtu_kan...
3,kalo ngobrol shootnya dua2nya akan lebih dinik...,"['kalo', 'ngobrol', 'shootnya', 'dua2nya', 'ak...","['kalo', 'ngobrol', 'shootnya', 'dua2nya', 'ak...","['kalo', 'ngobrol', 'shootnya', 'dua2nya', 'ak...","['kalo', 'ngobrol', 'shootnya', 'dua2nya', 'di...",kalo_ngobrol ngobrol_shootnya shootnya_dua2nya...,kalo_ngobrol_shootnya ngobrol_shootnya_dua2nya...
4,passwd admin1234,"['passwd', 'admin1234']","['passwd', 'admin1234']","['passwd', 'admin1234']","['passwd', 'admin1234']",passwd_admin1234,


### TF IDF

In [35]:
# Menampilkan kolom text_lemmatized
print(path['text_lemmatized'].head())

# check tipe data kolom text_lemmatized
print(path['text_lemmatized'].dtype)


0    [nangkep, bjorka, ngakak, gule, kalo, nangkep,...
1    [ma, nya, ini, masa, juga, g, bisa, mendeteksi...
2    [kek, udah, tenang, gtu, kan, gak, viral, si, ...
3    [kalo, ngobrol, shootnya, dua2nya, akan, lebih...
4                                  [passwd, admin1234]
Name: text_lemmatized, dtype: object
object


In [39]:
# Menampilkan kolom text_lemmatized
print(path['text_lemmatized'].head())

# check tipe data kolom text_lemmatized
print(path['text_lemmatized'].dtype)


0    [nangkep, bjorka, ngakak, gule, kalo, nangkep,...
1    [ma, nya, ini, masa, juga, g, bisa, mendeteksi...
2    [kek, udah, tenang, gtu, kan, gak, viral, si, ...
3    [kalo, ngobrol, shootnya, dua2nya, akan, lebih...
4                                  [passwd, admin1234]
Name: text_lemmatized, dtype: object
object


In [43]:
tfidf_vectorizer = TfidfVectorizer() # menangani proses tokenisasi, perhitungan frekuensi istilah (TF), dan perhitungan frekuensi dokumen terbalik (IDF) secara otomatis ketika Anda memanggil metode fit_transform() pada data teks Anda.

path['text_lemmatized_str'] = path['text_lemmatized'].apply(lambda x: ' '.join(x)): menggabungkan elemen-elemen dalam daftar tersebut menjadi satu string tunggal, dipisahkan oleh spasi. Ini diperlukan karena TfidfVectorizer biasanya mengharapkan input berupa string, bukan daftar token.


In [46]:
path['text_lemmatized_str'] = path['text_lemmatized'].apply(lambda x: ' '.join(x))
tfidf_matrix = tfidf_vectorizer.fit_transform(path['text_lemmatized_str']) #Mempelajari kosakata dari data teks di kolom text_lemmatized_str dan menghitung IDF untuk setiap kata.

In [47]:
tfidf_dense = tfidf_matrix.toarray() # Hasil dari fit_transform() adalah matriks jarang (tfidf_matrix) karena sebagian besar skor TF-IDF adalah nol. Metode .toarray() mengubah matriks jarang ini menjadi matriks NumPy padat (dense), di mana semua elemen disimpan, termasuk nol. Ini diperlukan jika Anda ingin bekerja dengan data dalam format array standar atau DataFrame.
feature_names = tfidf_vectorizer.get_feature_names_out() #Baris ini membuat DataFrame pandas baru bernama tfidf_df. Matriks padat tfidf_dense digunakan sebagai data untuk DataFrame, dan feature_names digunakan sebagai nama kolom. Setiap baris DataFrame merepresentasikan dokumen, dan setiap kolom merepresentasikan kata atau n-gram, dengan nilai-nilai di dalamnya adalah skor TF-IDF.
tfidf_df = pd.DataFrame(tfidf_dense, columns=feature_names)
display(tfidf_df)

Unnamed: 0,001,002,0030,003335,0057‚Ñô‚ÑùùïÜùîπùîºùïã855emang,0101,012,0242,030,0601,...,ùíäùíèùíä,ùíëùíÜùíèùíàùíÇùíçùíäùíâùíÇùíè,ùïãùïÇ‚Ñôùüõùüòùüõ,ùóõùó¢ùóûùóúùóïùó¢ùó¶ùü¥ùüµùü¥,ùôÇùôñùô†,ùôôùôûùôúùôñùôüùôû,ùôüùôñùôôùôû,ùô†ùôöùôßùôüùôñ,ùô•ùô§ùô°ùôûùô®ùôû,ùô©ùôñùô•ùôû
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [49]:
for col, tfidf_matrix in tfidf_results.items():
    tfidf_dense = tfidf_matrix.toarray() #Mengkonversi matriks TF-IDF yang jarang menjadi array NumPy yang padat. Ini diperlukan untuk membuat DataFrame pandas.
    tfidf_df = pd.DataFrame(tfidf_dense) #Membuat DataFrame pandas baru dari array padat
    print(f"TF-IDF DataFrame pada kolom: {col}")
    display(tfidf_df.head())

TF-IDF DataFrame pada kolom: text_tokens


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6323,6324,6325,6326,6327,6328,6329,6330,6331,6332
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


TF-IDF DataFrame pada kolom: text_stemmed


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6263,6264,6265,6266,6267,6268,6269,6270,6271,6272
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


TF-IDF DataFrame pada kolom: text_lemmatized


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6250,6251,6252,6253,6254,6255,6256,6257,6258,6259
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


TF-IDF DataFrame pada kolom: text_removal


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5875,5876,5877,5878,5879,5880,5881,5882,5883,5884
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
