# Preprocessing

## Install & Importing Library

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Ini adalah bagian awal dari kode yang mengimpor semua pustaka, modul, dan dependencies yang akan digunakan dalam analisis teks, seperti NLTK, Scikit-Learn, dan Pandas.

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer


import warnings
import pandas as pd
import numpy as np
import re
import nltk
import csv

nltk.download('stopwords')
warnings.filterwarnings("ignore")
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

***1. Load Dataset***

Ini adalah langkah untuk membaca data dari file CSV menggunakan Pandas. Data tersebut dimuat ke dalam DataFrame dengan nama df. DataFrame ini digunakan sebagai basis untuk analisis teks yang akan dilakukan.

In [None]:
df = pd.read_csv("/content/drive/MyDrive/PPW/crawling_pta_labeled - crawling_pta.csv")
df

Unnamed: 0,Judul,Penulis,Dosen Pembimbing I,Dosen Pembimbing II,Abstrak,Label
0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE ...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,Sistem informasi akademik (SIAKAD) merupaka...,RPL
1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",Berjalannya koneksi jaringan komputer dengan l...,RPL
2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK ENK...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",Web server adalah sebuah perangkat lunak serve...,RPL
3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",Penjadwalan kuliah di Perguruan Tinggi me...,KK
4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",Seiring perkembangan teknologi yang ada diduni...,KK
...,...,...,...,...,...,...
853,PENERAPAN ALGORITMA LONG-SHORT TERM MEMORY UNT...,Rachmad Agung Pambudi,"Eka Mala Sari Rochman, S.Kom., M.Kom","Sri Herawati, S.Kom., M.Kom",Investasi saham selama ini memiliki resiko ker...,KK
854,SISTEM PENCARIAN TEKS AL-QURAN TERJEMAHAN BERB...,Nadila Hidayanti,"Achmad Jauhari, S.T., M.Kom","Ika Oktavia Suzanti, S.Kom., M.Cs",Information Retrieval (IR) merupakan pengambil...,KK
855,KLASIFIKASI KOMPLEKSITAS VISUAL CITRA SAMPAH M...,Afni Sakinah,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Moch. Kautsar Sophan, S.Kom., M.MT.",Klasifikasi citra merupakan proses pengelompok...,KK
856,IDENTIFIKASI BINER ATRIBUT PEJALAN KAKI MENGGU...,Friska Fatmawatiningrum,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Prof. Dr. Arief Muntasa, S.Si., M.MT.",Identifikasi atribut pejalan kaki merupakan sa...,KK


## **1. Cleaning Data**


**Menghapus Data Null**

Kode ini memeriksa dan mengatasi data yang hilang (NaN) dalam DataFrame df. Data yang hilang dihapus dari DataFrame menggunakan df.dropna().

In [None]:
df.isnull().sum()

Judul                   6
Penulis                10
Dosen Pembimbing I     10
Dosen Pembimbing II    11
Abstrak                29
Label                   0
dtype: int64

In [None]:
df = df.dropna()
df.isnull().sum()

Judul                  0
Penulis                0
Dosen Pembimbing I     0
Dosen Pembimbing II    0
Abstrak                0
Label                  0
dtype: int64

**Menghapus Karakter Tertentu**

Fungsi `cleaning` digunakan untuk membersihkan teks dalam kolom 'Abstrak'. Ini menghapus karakter-karakter yang tidak relevan seperti tanda baca dan mengubah teks menjadi daftar kata-kata.

In [None]:
def cleaning(text):
  text = re.sub(r'[^a-zA-Z\s]', '', text).strip()
  return text

df['Cleaning'] = df['Abstrak'].apply(cleaning)
df['Cleaning']

0      Sistem  informasi  akademik  SIAKAD merupakan ...
1      Berjalannya koneksi jaringan komputer dengan l...
2      Web server adalah sebuah perangkat lunak serve...
3      Penjadwalan  kuliah  di  Perguruan  Tinggi  me...
4      Seiring perkembangan teknologi yang ada diduni...
                             ...                        
853    Investasi saham selama ini memiliki resiko ker...
854    Information Retrieval IR merupakan pengambilan...
855    Klasifikasi citra merupakan proses pengelompok...
856    Identifikasi atribut pejalan kaki merupakan sa...
857    Topik deteksi objek telah menarik perhatian ya...
Name: Cleaning, Length: 828, dtype: object

In [None]:
df[['Judul','Penulis','Dosen Pembimbing I', 'Dosen Pembimbing II', 'Abstrak', 'Cleaning']]

Unnamed: 0,Judul,Penulis,Dosen Pembimbing I,Dosen Pembimbing II,Abstrak,Cleaning
0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE ...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,Sistem informasi akademik (SIAKAD) merupaka...,Sistem informasi akademik SIAKAD merupakan ...
1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",Berjalannya koneksi jaringan komputer dengan l...,Berjalannya koneksi jaringan komputer dengan l...
2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK ENK...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",Web server adalah sebuah perangkat lunak serve...,Web server adalah sebuah perangkat lunak serve...
3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",Penjadwalan kuliah di Perguruan Tinggi me...,Penjadwalan kuliah di Perguruan Tinggi me...
4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",Seiring perkembangan teknologi yang ada diduni...,Seiring perkembangan teknologi yang ada diduni...
...,...,...,...,...,...,...
853,PENERAPAN ALGORITMA LONG-SHORT TERM MEMORY UNT...,Rachmad Agung Pambudi,"Eka Mala Sari Rochman, S.Kom., M.Kom","Sri Herawati, S.Kom., M.Kom",Investasi saham selama ini memiliki resiko ker...,Investasi saham selama ini memiliki resiko ker...
854,SISTEM PENCARIAN TEKS AL-QURAN TERJEMAHAN BERB...,Nadila Hidayanti,"Achmad Jauhari, S.T., M.Kom","Ika Oktavia Suzanti, S.Kom., M.Cs",Information Retrieval (IR) merupakan pengambil...,Information Retrieval IR merupakan pengambilan...
855,KLASIFIKASI KOMPLEKSITAS VISUAL CITRA SAMPAH M...,Afni Sakinah,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Moch. Kautsar Sophan, S.Kom., M.MT.",Klasifikasi citra merupakan proses pengelompok...,Klasifikasi citra merupakan proses pengelompok...
856,IDENTIFIKASI BINER ATRIBUT PEJALAN KAKI MENGGU...,Friska Fatmawatiningrum,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Prof. Dr. Arief Muntasa, S.Si., M.MT.",Identifikasi atribut pejalan kaki merupakan sa...,Identifikasi atribut pejalan kaki merupakan sa...


Fungsi `cek_specialCharacter` digunakan untuk mendeteksi karakter khusus dalam teks yang telah dibersihkan. Jika karakter khusus ditemukan, teks tersebut dicetak.

In [None]:
def cek_specialCharacter(dokumen):
  karakter = ['!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '-', '_', '+', '=', '{', '}', '[', ']', '|', '\\', ':', ';', '"', "'", '<', '>', ',', '.', '?', '/', '`', '~']
  for i in dokumen:
    if i in karakter :
      print(dokumen)
df['Cleaning'].apply(cek_specialCharacter)

0      None
1      None
2      None
3      None
4      None
       ... 
853    None
854    None
855    None
856    None
857    None
Name: Cleaning, Length: 828, dtype: object

## **2. Tokenizing**

Fungsi tokenizer digunakan untuk melakukan tokenisasi dan juga proses case folding(mengubah menjadi huruf kecil) pada teks yang telah dibersihkan. Ini mengubah teks menjadi token-token kata. Pada tahap ini akan dibantu dengan library NLTK.

In [None]:
def tokenizer(text):
  text = text.lower()
  return word_tokenize(text)

df['Tokenizing'] = df['Cleaning'].apply(tokenizer)
df['Tokenizing']

0      [sistem, informasi, akademik, siakad, merupaka...
1      [berjalannya, koneksi, jaringan, komputer, den...
2      [web, server, adalah, sebuah, perangkat, lunak...
3      [penjadwalan, kuliah, di, perguruan, tinggi, m...
4      [seiring, perkembangan, teknologi, yang, ada, ...
                             ...                        
853    [investasi, saham, selama, ini, memiliki, resi...
854    [information, retrieval, ir, merupakan, pengam...
855    [klasifikasi, citra, merupakan, proses, pengel...
856    [identifikasi, atribut, pejalan, kaki, merupak...
857    [topik, deteksi, objek, telah, menarik, perhat...
Name: Tokenizing, Length: 828, dtype: object

In [None]:
df[['Judul','Penulis','Dosen Pembimbing I', 'Dosen Pembimbing II', 'Abstrak', 'Cleaning', 'Tokenizing']]

Unnamed: 0,Judul,Penulis,Dosen Pembimbing I,Dosen Pembimbing II,Abstrak,Cleaning,Tokenizing
0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE ...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,Sistem informasi akademik (SIAKAD) merupaka...,Sistem informasi akademik SIAKAD merupakan ...,"[sistem, informasi, akademik, siakad, merupaka..."
1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",Berjalannya koneksi jaringan komputer dengan l...,Berjalannya koneksi jaringan komputer dengan l...,"[berjalannya, koneksi, jaringan, komputer, den..."
2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK ENK...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",Web server adalah sebuah perangkat lunak serve...,Web server adalah sebuah perangkat lunak serve...,"[web, server, adalah, sebuah, perangkat, lunak..."
3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",Penjadwalan kuliah di Perguruan Tinggi me...,Penjadwalan kuliah di Perguruan Tinggi me...,"[penjadwalan, kuliah, di, perguruan, tinggi, m..."
4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",Seiring perkembangan teknologi yang ada diduni...,Seiring perkembangan teknologi yang ada diduni...,"[seiring, perkembangan, teknologi, yang, ada, ..."
...,...,...,...,...,...,...,...
853,PENERAPAN ALGORITMA LONG-SHORT TERM MEMORY UNT...,Rachmad Agung Pambudi,"Eka Mala Sari Rochman, S.Kom., M.Kom","Sri Herawati, S.Kom., M.Kom",Investasi saham selama ini memiliki resiko ker...,Investasi saham selama ini memiliki resiko ker...,"[investasi, saham, selama, ini, memiliki, resi..."
854,SISTEM PENCARIAN TEKS AL-QURAN TERJEMAHAN BERB...,Nadila Hidayanti,"Achmad Jauhari, S.T., M.Kom","Ika Oktavia Suzanti, S.Kom., M.Cs",Information Retrieval (IR) merupakan pengambil...,Information Retrieval IR merupakan pengambilan...,"[information, retrieval, ir, merupakan, pengam..."
855,KLASIFIKASI KOMPLEKSITAS VISUAL CITRA SAMPAH M...,Afni Sakinah,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Moch. Kautsar Sophan, S.Kom., M.MT.",Klasifikasi citra merupakan proses pengelompok...,Klasifikasi citra merupakan proses pengelompok...,"[klasifikasi, citra, merupakan, proses, pengel..."
856,IDENTIFIKASI BINER ATRIBUT PEJALAN KAKI MENGGU...,Friska Fatmawatiningrum,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Prof. Dr. Arief Muntasa, S.Si., M.MT.",Identifikasi atribut pejalan kaki merupakan sa...,Identifikasi atribut pejalan kaki merupakan sa...,"[identifikasi, atribut, pejalan, kaki, merupak..."


Menghitung jumlah kata dalam tiap abstrak

In [None]:
def count_word(dokumens):
  return len(dokumens)

df['Count Word'] = df['Tokenizing'].apply(count_word)
df

Unnamed: 0,Judul,Penulis,Dosen Pembimbing I,Dosen Pembimbing II,Abstrak,Label,Cleaning,Tokenizing,Count Word
0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE ...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,Sistem informasi akademik (SIAKAD) merupaka...,RPL,Sistem informasi akademik SIAKAD merupakan ...,"[sistem, informasi, akademik, siakad, merupaka...",150
1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",Berjalannya koneksi jaringan komputer dengan l...,RPL,Berjalannya koneksi jaringan komputer dengan l...,"[berjalannya, koneksi, jaringan, komputer, den...",204
2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK ENK...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",Web server adalah sebuah perangkat lunak serve...,RPL,Web server adalah sebuah perangkat lunak serve...,"[web, server, adalah, sebuah, perangkat, lunak...",182
3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",Penjadwalan kuliah di Perguruan Tinggi me...,KK,Penjadwalan kuliah di Perguruan Tinggi me...,"[penjadwalan, kuliah, di, perguruan, tinggi, m...",134
4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",Seiring perkembangan teknologi yang ada diduni...,KK,Seiring perkembangan teknologi yang ada diduni...,"[seiring, perkembangan, teknologi, yang, ada, ...",137
...,...,...,...,...,...,...,...,...,...
853,PENERAPAN ALGORITMA LONG-SHORT TERM MEMORY UNT...,Rachmad Agung Pambudi,"Eka Mala Sari Rochman, S.Kom., M.Kom","Sri Herawati, S.Kom., M.Kom",Investasi saham selama ini memiliki resiko ker...,KK,Investasi saham selama ini memiliki resiko ker...,"[investasi, saham, selama, ini, memiliki, resi...",173
854,SISTEM PENCARIAN TEKS AL-QURAN TERJEMAHAN BERB...,Nadila Hidayanti,"Achmad Jauhari, S.T., M.Kom","Ika Oktavia Suzanti, S.Kom., M.Cs",Information Retrieval (IR) merupakan pengambil...,KK,Information Retrieval IR merupakan pengambilan...,"[information, retrieval, ir, merupakan, pengam...",134
855,KLASIFIKASI KOMPLEKSITAS VISUAL CITRA SAMPAH M...,Afni Sakinah,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Moch. Kautsar Sophan, S.Kom., M.MT.",Klasifikasi citra merupakan proses pengelompok...,KK,Klasifikasi citra merupakan proses pengelompok...,"[klasifikasi, citra, merupakan, proses, pengel...",259
856,IDENTIFIKASI BINER ATRIBUT PEJALAN KAKI MENGGU...,Friska Fatmawatiningrum,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Prof. Dr. Arief Muntasa, S.Si., M.MT.",Identifikasi atribut pejalan kaki merupakan sa...,KK,Identifikasi atribut pejalan kaki merupakan sa...,"[identifikasi, atribut, pejalan, kaki, merupak...",211


## **3. Stopword Removal**

Stopword adalah kata-kata umum yang sering tidak memiliki nilai dalam analisis teks. Fungsi stopwordText digunakan untuk menghapus stopword dari token-token kata yang telah dihasilkan.

Token-token kata yang telah dibersihkan dan memiliki stopword yang dihapus digabungkan kembali menjadi teks utuh dan disimpan dalam kolom 'Full Text'. Tahap ini dibantu dengan library NLTK.

In [None]:
corpus = stopwords.words('indonesian')

def stopwordText(words):
 return [word for word in words if word not in corpus]

df['Stopword Removal'] = df['Tokenizing'].apply(stopwordText)

# Gabungkan kembali token menjadi kalimat utuh
df['Full Text'] = df['Stopword Removal'].apply(lambda x: ' '.join(x))
df['Full Text']

0      sistem informasi akademik siakad sistem inform...
1      berjalannya koneksi jaringan komputer lancar g...
2      web server perangkat lunak server berfungsi me...
3      penjadwalan kuliah perguruan kompleks permasal...
4      seiring perkembangan teknologi didunia muncul ...
                             ...                        
853    investasi saham memiliki resiko kerugian dikar...
854    information retrieval ir pengambilan informasi...
855    klasifikasi citra proses pengelompokan piksel ...
856    identifikasi atribut pejalan kaki salah peneli...
857    topik deteksi objek menarik perhatian perkemba...
Name: Full Text, Length: 828, dtype: object

In [None]:
df[['Judul','Penulis','Dosen Pembimbing I', 'Dosen Pembimbing II', 'Abstrak', 'Cleaning', 'Tokenizing', 'Stopword Removal']]

Unnamed: 0,Judul,Penulis,Dosen Pembimbing I,Dosen Pembimbing II,Abstrak,Cleaning,Tokenizing,Stopword Removal
0,PERANCANGAN DAN IMPLEMENTASI SISTEM DATABASE ...,A.Ubaidillah S.Kom,Budi Setyono M.T,Hermawan S.T,Sistem informasi akademik (SIAKAD) merupaka...,Sistem informasi akademik SIAKAD merupakan ...,"[sistem, informasi, akademik, siakad, merupaka...","[sistem, informasi, akademik, siakad, sistem, ..."
1,APLIKASI KONTROL DAN MONITORING JARINGAN KOMPU...,"M. Basith Ardianto,","Drs. Budi Soesilo, MT","Koko Joni, ST",Berjalannya koneksi jaringan komputer dengan l...,Berjalannya koneksi jaringan komputer dengan l...,"[berjalannya, koneksi, jaringan, komputer, den...","[berjalannya, koneksi, jaringan, komputer, lan..."
2,RANCANG BANGUN APLIKASI PROXY SERVER UNTUK ENK...,"Akhmad Suyandi, S.Kom","Drs. Budi Soesilo, M.T","Hermawan, ST, MT",Web server adalah sebuah perangkat lunak serve...,Web server adalah sebuah perangkat lunak serve...,"[web, server, adalah, sebuah, perangkat, lunak...","[web, server, perangkat, lunak, server, berfun..."
3,SISTEM PENDUKUNG KEPUTUSAN OPTIMASI PENJADWALA...,Heri Supriyanto,"Mulaab, S.Si., M.Kom","Firli Irhamni, ST., M.Kom",Penjadwalan kuliah di Perguruan Tinggi me...,Penjadwalan kuliah di Perguruan Tinggi me...,"[penjadwalan, kuliah, di, perguruan, tinggi, m...","[penjadwalan, kuliah, perguruan, kompleks, per..."
4,SISTEM AUGMENTED REALITY ANIMASI BENDA BERGERA...,Septian Rahman Hakim,"Arik Kurniawati, S.Kom., M.T.","Haryanto, S.T., M.T.",Seiring perkembangan teknologi yang ada diduni...,Seiring perkembangan teknologi yang ada diduni...,"[seiring, perkembangan, teknologi, yang, ada, ...","[seiring, perkembangan, teknologi, didunia, mu..."
...,...,...,...,...,...,...,...,...
853,PENERAPAN ALGORITMA LONG-SHORT TERM MEMORY UNT...,Rachmad Agung Pambudi,"Eka Mala Sari Rochman, S.Kom., M.Kom","Sri Herawati, S.Kom., M.Kom",Investasi saham selama ini memiliki resiko ker...,Investasi saham selama ini memiliki resiko ker...,"[investasi, saham, selama, ini, memiliki, resi...","[investasi, saham, memiliki, resiko, kerugian,..."
854,SISTEM PENCARIAN TEKS AL-QURAN TERJEMAHAN BERB...,Nadila Hidayanti,"Achmad Jauhari, S.T., M.Kom","Ika Oktavia Suzanti, S.Kom., M.Cs",Information Retrieval (IR) merupakan pengambil...,Information Retrieval IR merupakan pengambilan...,"[information, retrieval, ir, merupakan, pengam...","[information, retrieval, ir, pengambilan, info..."
855,KLASIFIKASI KOMPLEKSITAS VISUAL CITRA SAMPAH M...,Afni Sakinah,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Moch. Kautsar Sophan, S.Kom., M.MT.",Klasifikasi citra merupakan proses pengelompok...,Klasifikasi citra merupakan proses pengelompok...,"[klasifikasi, citra, merupakan, proses, pengel...","[klasifikasi, citra, proses, pengelompokan, pi..."
856,IDENTIFIKASI BINER ATRIBUT PEJALAN KAKI MENGGU...,Friska Fatmawatiningrum,"Dr. Indah Agustien Siradjuddin, S.Kom., M.Kom.","Prof. Dr. Arief Muntasa, S.Si., M.MT.",Identifikasi atribut pejalan kaki merupakan sa...,Identifikasi atribut pejalan kaki merupakan sa...,"[identifikasi, atribut, pejalan, kaki, merupak...","[identifikasi, atribut, pejalan, kaki, salah, ..."


## **4. Stemming**

Proses stemming dilakukan perubahan kata yang berimbuhan menjadi kata dasar. Tahap ini dibantu dengan library Sastrawi dan Swifter.



```
def stemmingText(dokumens):
  factory = StemmerFactory()
  stemmer = factory.create_stemmer()

  return [stemmer.stem(i) for i in dokumens]

df['Stemming'] = df['Stopword Removal'].apply(stemmingText)
df['Stemming']
```



# VSM (Vector Space Model)

###1. One Hot Encoding

**Fungsi One Hot Encoder Using Pandas**

Fungsi pandasOneHotEncoder digunakan untuk melakukan one-hot encoding pada token-token kata yang telah dibersihkan dan memiliki stopword yang dihapus. Hasilnya adalah DataFrame yang mewakili keberadaan atau ketiadaan setiap kata dalam setiap dokumen.

In [None]:
def pandasOneHotEncoder(dokumens):
  encoder  = pd.get_dummies(dokumens.apply(pd.Series).stack()).sum(level=0)
  df = pd.concat([dokumens, encoder], axis=1)

  return df

oneHotEncoder = pandasOneHotEncoder(df['Stopword Removal'])
oneHotEncoder

Unnamed: 0,Stopword Removal,a,aalysis,aam,ab,abad,abadi,ability,abjad,absensi,...,zara,zat,zcz,zf,zona,zone,zoning,zoom,zucara,zungu
0,"[sistem, informasi, akademik, siakad, sistem, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"[berjalannya, koneksi, jaringan, komputer, lan...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"[web, server, perangkat, lunak, server, berfun...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"[penjadwalan, kuliah, perguruan, kompleks, per...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"[seiring, perkembangan, teknologi, didunia, mu...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
853,"[investasi, saham, memiliki, resiko, kerugian,...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
854,"[information, retrieval, ir, pengambilan, info...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
855,"[klasifikasi, citra, proses, pengelompokan, pi...",2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
856,"[identifikasi, atribut, pejalan, kaki, salah, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Save Into CSV**

In [None]:
oneHotEncoder.to_csv('OneHotEncoder.csv', index=False)

###2. TF-IDF

**Fungsi TF-IDF**

Fungsi `tfidf` digunakan untuk melakukan TF-IDF vectorization pada teks yang telah dibersihkan dan memiliki stopword yang dihapus. Hasilnya adalah representasi numerik dari dokumen menggunakan skema TF-IDF.

In [None]:
def tfidf(dokumen):
  vectorizer = TfidfVectorizer()
  x = vectorizer.fit_transform(dokumen).toarray()
  terms = vectorizer.get_feature_names_out()

  final_tfidf = pd.DataFrame(x, columns=terms)
  final_tfidf.insert(0, 'Dokumen', dokumen)

  return (vectorizer, final_tfidf)

tfidf_vectorizer, final_tfidf = tfidf(df['Full Text'])
final_tfidf

Unnamed: 0,Dokumen,aalysis,aam,ab,abad,abadi,ability,abjad,absensi,absolut,...,zara,zat,zcz,zf,zona,zone,zoning,zoom,zucara,zungu
0,sistem informasi akademik siakad sistem inform...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,berjalannya koneksi jaringan komputer lancar g...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,web server perangkat lunak server berfungsi me...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,penjadwalan kuliah perguruan kompleks permasal...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,seiring perkembangan teknologi didunia muncul ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
823,kurangnya pemahaman gejala penyakit saluran pe...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
824,data set hilang utama studi bersifat substansi...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
825,proses seleksi penerimaan tenaga kerja faktor ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
826,sapi salah hewan ternak komoditi utama bahan p...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Save into CSV

In [None]:
final_tfidf.to_csv('TF IDF.csv', index=False)

## 3. Term Frequensi

### Fungsi Term Frequensi

Fungsi `term_freq` digunakan untuk melakukan Term Frequency (TF) vectorization pada teks yang telah dibersihkan dan memiliki stopword yang dihapus. Hasilnya adalah representasi numerik dari dokumen menggunakan skema Term Frequency.

In [None]:
def term_freq(dokumens):
  # Buat objek CountVectorizer
  vectorizer = CountVectorizer()
  tf_matrix = vectorizer.fit_transform(dokumens).toarray()
  terms = vectorizer.get_feature_names_out()

  final_tf = pd.DataFrame(tf_matrix, columns=terms)
  final_tf.insert(0, 'Dokumen', dokumens)

  return (vectorizer, final_tf, tf_matrix, terms)

tf_vectorizer, final_tf, tf_matrix, tf_terms = term_freq(df['Full Text'])
final_tf

Unnamed: 0,Dokumen,aalysis,aam,ab,abad,abadi,ability,abjad,absensi,absolut,...,zara,zat,zcz,zf,zona,zone,zoning,zoom,zucara,zungu
0,sistem informasi akademik siakad sistem inform...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,berjalannya koneksi jaringan komputer lancar g...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,web server perangkat lunak server berfungsi me...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,penjadwalan kuliah perguruan kompleks permasal...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,seiring perkembangan teknologi didunia muncul ...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
823,kurangnya pemahaman gejala penyakit saluran pe...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
824,data set hilang utama studi bersifat substansi...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
825,proses seleksi penerimaan tenaga kerja faktor ...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
826,sapi salah hewan ternak komoditi utama bahan p...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Save into CSV

In [None]:
final_tf.to_csv('Term Frequensi.csv', index=False)


## 4. Logarithm Freqency

Fungsi `logarithm_freq` digunakan untuk melakukan transformasi frekuensi logaritmik pada data Term Frequency. Ini membantu dalam mengurangi dampak dominasi kata-kata yang sangat umum dalam analisis teks.

### Fungsi Logarithm Frequensi

In [None]:
def logarithm_freq(dokumens):
  return np.log10(dokumens + 1)

df_logarithm_freq = pd.DataFrame(tf_matrix, columns=tf_terms).apply(logarithm_freq)
df_logarithm_freq.insert(0, 'Dokumen', df['Full Text'])
df_logarithm_freq

Unnamed: 0,Dokumen,aalysis,aam,ab,abad,abadi,ability,abjad,absensi,absolut,...,zara,zat,zcz,zf,zona,zone,zoning,zoom,zucara,zungu
0,sistem informasi akademik siakad sistem inform...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,berjalannya koneksi jaringan komputer lancar g...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,web server perangkat lunak server berfungsi me...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,penjadwalan kuliah perguruan kompleks permasal...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,seiring perkembangan teknologi didunia muncul ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
823,kurangnya pemahaman gejala penyakit saluran pe...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
824,data set hilang utama studi bersifat substansi...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
825,proses seleksi penerimaan tenaga kerja faktor ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
826,sapi salah hewan ternak komoditi utama bahan p...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Save into CSV

In [None]:
df_logarithm_freq.to_csv('Logarithm Frequensi.csv', index=False)