#**Pembuka**

Assalamu'alaikum warahmatullahi wabarakatuh. Puji syukur kehadirat Allah Subhana Wata'ala atas limpahan Rahmat dan HidayahNya kepada kita semua. Sholawat serta salam senantiasa tercurah limpahkan kepada baginda Muhammad Rasulullah Salallahualaihiwassalam.

Halo para **Pejuang Data**. Selamat berjumpa di pertemuan ketujuh Program Training **Algoritma Machine Learning** Kelas Mahir.

Pada pertemuan ini kamu akan belajar:

*   Apa itu NLP
*   Scrapping Data
*   Text Preprocessing
    *   Case Folding & Data Cleaning
    *   Lemmatisasi
    *   Stemming
    *   Slang Word
    *   Stop Word
    *   Unwanted Word

#**Natural Language Processing**

Natural Language Processing (NLP) merupakan salah satu cabang ilmu AI yang berfokus pada pengolahan bahasa natural. Bahasa natural adalah bahasa yang secara umum digunakan oleh manusia dalam berkomunikasi satu sama lain. Bahasa yang diterima oleh komputer butuh untuk diproses dan dipahami terlebih dahulu supaya maksud dari user bisa dipahami dengan baik oleh komputer.

Ada berbagai terapan aplikasi dari NLP. Diantaranya adalah Chatbot (aplikasi yang membuat user bisa seolah-olah melakukan komunikasi dengan computer), Stemming atau Lemmatization (pemotongan kata dalam bahasa tertentu menjadi bentuk dasar pengenalan fungsi setiap kata dalam kalimat), Summarization (ringkasan dari bacaan), Translation Tools (menterjemahkan bahasa) dan aplikasi-aplikasi lain yang memungkinkan
komputer mampu memahami instruksi bahasa yang diinputkan oleh user.

Pustejovsky dan Stubbs (2012) menjelaskan bahwa ada beberapa area utama penelitian pada field NLP, diantaranya:

1. **Question Answering Systems (QAS)**. Kemampuan komputer untuk menjawab pertanyaan yang diberikan oleh user. Daripada memasukkan keyword ke dalam browser pencarian, dengan QAS, user bisa langsung bertanya dalam bahasa natural yang digunakannya, baik itu Inggris, Mandarin, ataupun Indonesia.

2. **Summarization**. Pembuatan ringkasan dari sekumpulan konten dokumen atau email. Denganmenggunakan aplikasi ini, user bisa dibantu untuk mengkonversikan dokumen teks yang besar ke dalam bentuk slide presentasi. Machine Translation. Produk yang dihasilkan adalah aplikasi yang dapat memahami bahasa manusia dan menterjemahkannya ke dalam bahasa lain. Termasuk di dalamnya adalah Google Translate yang apabila dicermati semakin membaik dalam penterjemahan bahasa. Contoh lain lagi adalah BabelFish yang menterjemahkan bahasa pada real time.

3. **Speech Recognition**. Field ini merupakan cabang ilmu NLP yang cukup sulit. Proses pembangunan model untuk digunakan telpon/komputer dalam mengenali bahasa yang diucapkan sudah banyak dikerjakan. Bahasa yang sering digunakan adalah berupa pertanyaan dan perintah.

4. **Document classification**. Sedangkan aplikasi ini adalah merupakan area penelitian NLP Yang paling sukses. Pekerjaan yang dilakukan aplikasi ini adalah menentukan dimana tempat terbaik dokumen yang baru diinputkan ke dalam sistem. Hal ini sangat berguna pada aplikasi spam filtering, news article
classification, dan movie review.

#**Scrapping Data Text**

Sebelum melakukan penerapan dan berbagai penelitian. Mengumpulkan data teks sebagai bahan dasar dari bidang ini merupakan hal yang sangat penting. Proses ini biasa disebut dengan scrapping data. Aktivitas scrapping data bisa dilakukan melalui berbagai platfrom. Mulai langsung pada halaman web tertentu, melalui API seperti Twitter, atau melalui tools yang sudah disediakan, bisa free atau berbayar. Untuk mulai belajar NLP, kita akan menggunakan tools. Toolls Library google_play_scrapper adalah library yang dapat digunakan untuk mengambil review dari google apps. Pertama kita perlu melakukan instalasi sebagai berikut.

**Instalasi google play scrapper**

In [75]:
!pip install google_play_scraper



**Import Library**

In [2]:
import numpy as np
import pandas as pd
from google_play_scraper import Sort, reviews #Library untuk scrapping data
import re #Library untuk teks processing

#**Data Preparation**

In [10]:
df = pd.read_csv('https://raw.githubusercontent.com/LutfiaRahmah/PSDS-3.0/main/dataset_tweet_sentiment_pilkada_DKI_2017.csv')
df

Unnamed: 0,Id,Sentiment,Pasangan Calon,Text Tweet
0,1,negative,Agus-Sylvi,Banyak akun kloning seolah2 pendukung #agussil...
1,2,negative,Agus-Sylvi,#agussilvy bicara apa kasihan yaa...lap itu ai...
2,3,negative,Agus-Sylvi,Kalau aku sih gak nunggu hasil akhir QC tp lag...
3,4,negative,Agus-Sylvi,Kasian oh kasian dengan peluru 1milyar untuk t...
4,5,negative,Agus-Sylvi,Maaf ya pendukung #AgusSilvy..hayo dukung #Ani...
...,...,...,...,...
895,896,positive,Anies-Sandi,"Kali saja bpk @aniesbaswedan @sandiuno lihat, ..."
896,897,positive,Anies-Sandi,Kita harus dapat merangkul semua orang tanpa b...
897,898,positive,Anies-Sandi,Ini jagoanku dibidang digital <Smiling Face Wi...
898,899,positive,Anies-Sandi,#PesanBijak #OkeOce #GubernurGu3 ...


**Mengambil Series Data Text Tweet**

In [11]:
df_text = df['Text Tweet']
df_text

0      Banyak akun kloning seolah2 pendukung #agussil...
1      #agussilvy bicara apa kasihan yaa...lap itu ai...
2      Kalau aku sih gak nunggu hasil akhir QC tp lag...
3      Kasian oh kasian dengan peluru 1milyar untuk t...
4      Maaf ya pendukung #AgusSilvy..hayo dukung #Ani...
                             ...                        
895    Kali saja bpk @aniesbaswedan @sandiuno lihat, ...
896    Kita harus dapat merangkul semua orang tanpa b...
897    Ini jagoanku dibidang digital <Smiling Face Wi...
898                 #PesanBijak #OkeOce #GubernurGu3 ...
899    Sandiaga: Bangun Rumah DP 0% Lebih Simpel Diba...
Name: Text Tweet, Length: 900, dtype: object

#**Teks Preprocessing**

Setelah mendapat data teks. Salah satu tantangan dari data teks adalah bentuknya yang sangat beragam. Sebuah kata dapat ditulis dengan berbagai bentuk. Kemudian juga besar sekali kemungkinan adalah kesalahan penulisan. Tanda baca, angka, dan lain-lain. Oleh sebab itu, sebelum diolah lebih lanjut untuk diproses menjadi data numerik, maka diperlukan pemrosesan data teks agar menjadi bentuk yang lebih bersih dan standar. Yang akan sangat mempengaruhi hasil analisis data teks tersebut. Pada sentimen analisis
misalnya, langkah ini menjadi sangat penting. Ada beberapa hal yang dilakukan pada tahap Teks Preprocessing:

**1. Case Folding & Data Cleaning**

Case folding adalah salah satu bentuk text preprocessing yang paling sederhana dan efektif meskipun sering diabaikan. Tujuan dari case folding untuk mengubah semua huruf dalam dokumen menjadi huruf kecil. Hanya huruf ‘a’ sampai ‘z’ yang diterima. Karakter selain huruf dihilangkan dan dianggap delimiter.

Ada beberapa cara yang dapat digunakan dalam tahap case folding, diantaranya:

*   Menghapus tanda baca
*   Menghapus angka
*   Mengubah text menjadi lowercase
*   Menghapus whitepace (karakter kosong)

In [12]:
import re, string, unicodedata
def Case_Folding(text):
  # Hapus non-ascii
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  # Menghapus Tanda Baca
  text = re.sub(r'[^\w]|_',' ', text)
  # Menghapus Angka
  text = re.sub("\S*\d\S*", "", text).strip()
  text = re.sub(r"\b\d+\b", " ", text)
  # Mengubah text menjadi lowercase
  text = text.lower()
  # Menghapus white space
  text = re.sub('[\s]+', ' ', text)
  return text

#**Lemmatization**

Proses pengurangan berbagai bentuk kata yang berubah menjadi satu bentuk untuk memudahkan analisis. e.g. kata dari “swim”, “swimming”, “swims”, “swam”, adalah semua bentuk dari “swim”. Nah jadi lemma dari semua kata-kata tersebut adalah “swim”.

Untuk data teks berbahasa Indonesia, kita akan menggunakan library nlp-id . Pertama kita harus menginstallnya terlebih dahulu.

In [13]:
!pip install nlp-id

Collecting nlp-id
  Downloading nlp_id-0.1.12.0.tar.gz (7.9 MB)
[K     |████████████████████████████████| 7.9 MB 1.5 MB/s 
[?25hCollecting scikit-learn==0.22
  Downloading scikit_learn-0.22-cp37-cp37m-manylinux1_x86_64.whl (7.0 MB)
[K     |████████████████████████████████| 7.0 MB 20.4 MB/s 
[?25hCollecting nltk==3.4.5
  Downloading nltk-3.4.5.zip (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 42.3 MB/s 
[?25hCollecting wget==3.2
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: nlp-id, nltk, wget
  Building wheel for nlp-id (setup.py) ... [?25l[?25hdone
  Created wheel for nlp-id: filename=nlp_id-0.1.12.0-py3-none-any.whl size=8074105 sha256=7384cf4306ac0b52a612ca70aa1afddc67544861d92f2203654d56e5821a96dc
  Stored in directory: /root/.cache/pip/wheels/b2/50/48/da59531125bd94f48dfe66140f41d8fd8a4f04062050375013
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.4.5-py3-none-any.whl size=1449923 

Kemudian kita akan menggunakan fungsi Lemmatizer() untuk melakukan lemmatisasi data teks.

In [14]:
from nlp_id.lemmatizer import Lemmatizer
lemmatizer = Lemmatizer()

#**Stemming**

Stemming merupakan suatu proses untuk menemukan kata dasar dari sebuah kata. Dengan menghilangkan semua imbuhan (affixes) baik yang terdiri dari awalan (prefixes), sisipan (infixes), akhiran (suffixes) dan confixes (kombinasi dari awalan dan akhiran) pada kata turunan. Stemming digunakan untuk mengganti bentuk
dari suatu kata menjadi kata dasar dari kata tersebut yang sesuai dengan struktur morfologi Bahasa Indonesia yang baik dan benar.

Untuk data teks berbahasa Indonesia, kita akan menggunakan library PySastrawi. Pertama kita harus menginstallnya terlebih dahulu.

In [15]:
!pip install PySastrawi

Collecting PySastrawi
  Downloading PySastrawi-1.2.0-py2.py3-none-any.whl (210 kB)
[K     |████████████████████████████████| 210 kB 1.9 MB/s 
[?25hInstalling collected packages: PySastrawi
Successfully installed PySastrawi-1.2.0


Kemudian kita akan menggunakan fungsi StemmerFactory() untuk melakukan stemming.

In [16]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
# Membuat stemmer
factory = StemmerFactory()
stemmer = factory.create_stemmer()

#**Slang Words**

Slang adalah kata-kata yang tidak baku secara bahasa namun sering dipakai oleh pengguna bahasa. Kita perlu melakukan standarisasi untuk slang.

In [17]:
df_sentiment = pd.Series(df['Pasangan Calon'].values,index=df['Sentiment']).to_dict()

In [18]:
def Slangwords(text):
  for word in text.split():
    if word in df_sentiment.keys():
      text = text.replace(word, df_sentiment[word])
  return text

#**Stopword**

Stop words adalah kata umum (common words) yang biasanya muncul dalam jumlah besar dan dianggap tidak memiliki makna. Stop words umumnya dimanfaatkan dalam task information retrieval, termasuk oleh Google (penjelasannya di sini). Contoh stop words untuk bahasa Inggris di antaranya “of”, “the”. Sedangkan untuk
bahasa Indonesia diantaranya “yang”, “di”, “ke”.

In [19]:
from nlp_id.stopword import StopWord
stopword = StopWord()

#**Unwanted Words**

Unwanted words adalah kata-kata yang berada di luar beberapa hal di atas namun perlu untuk kita hapus. Kita bisa mendefinisikan sendiri kata-kata atau karakter yang ingin kita hilangkan dari data teks yang kita peroleh.

In [60]:
unwanted_words = ['jagoanku','sy', 'karna', 'gue', 'pun', 'nya', 'yg', 'gw', 'ke', 'gak', 
                 'ga', 'buat', 'selama', 'akan', 'gua', 'gw', 'gue', 'banget', 
                 'mohon', 'dii', 'kalo', 'dll', 'cuman', 'cuma', 'biar', 'kayak', 
                 'ssaja', 'sih', 'si', 'situ', 'e', 'diin', 'dua', 'untuj', 'deh', 
                 'jd', 'ku', 'lg', 'and', 'tuh', 'nih', 'mas', 'mbak', 'tau', 'iya',
                 'ya', 'lu', 'pas', 'wkwk', 'haha', 'wkwkwk', 'wkwkw', 'wow', 'akak',
                 'anjir', 'lo', 'loh', 'bang', 'kak', 'twit', 'eh', 'oh', 'yuk', 'gila',
                 'anies', 'mending', 'engenggak', 'a', 'mah', 'kali', 'silvy','sandy']

In [61]:
import nltk
from nltk import word_tokenize, sent_tokenize
nltk.download('punkt')

def RemoveUnwantedwords(text):
  word_tokens = word_tokenize(text)
  filtered_sentence = [word for word in word_tokens if not word in unwanted_words]
  return ' '.join(filtered_sentence)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#**Menerapkan Semua Langkah**

In [62]:
df['content_processed'] = ''
for i, row in df.iterrows():
  content = df_text[i]
  result = Case_Folding(content)
  result = lemmatizer.lemmatize(result)
  result = stemmer.stem(result)
  result = stopword.remove_stopword(result)
  result = RemoveUnwantedwords(result)
  df['content_processed'][i] = result

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [63]:
df[['Text Tweet', 'content_processed']]

Unnamed: 0,Text Tweet,content_processed
0,Banyak akun kloning seolah2 pendukung #agussil...,akun kloning dukung agussilvy serang paslon an...
1,#agussilvy bicara apa kasihan yaa...lap itu ai...,agussilvy bicara kasihan yaa lap air mata wkwk...
2,Kalau aku sih gak nunggu hasil akhir QC tp lag...,nunggu hasil qc tp nunggu motif cuit sbyudhoyo...
3,Kasian oh kasian dengan peluru 1milyar untuk t...,kasi kasi peluru rw agussilvy mempan menangin ...
4,Maaf ya pendukung #AgusSilvy..hayo dukung #Ani...,maaf dukung agussilvy hayo dukung aniessandi p...
...,...,...
895,"Kali saja bpk @aniesbaswedan @sandiuno lihat, ...",bpk aniesbaswedan sandiuno lihat rspun selfie ...
896,Kita harus dapat merangkul semua orang tanpa b...,rangkul orang batas usia kelamin okeoce ok han...
897,Ini jagoanku dibidang digital <Smiling Face Wi...,jago bidang digital smiling face with sunglass...
898,#PesanBijak #OkeOce #GubernurGu3 ...,pesanbijak okeoce


#**Representasi Kata Menggunakan Matriks**

##**Bag Of Word**

Bag of word adalah metode untuk merepresentasikan kata-kata pada sebuah matriks untuk diolah pada metode NLP yang paling sederhana. Ide dasar dari metode ini adalah menghitung kemunculan sebuah kata pada sebuah kalimat tertentu. Misal dalama kalimat:

    `Selamat Pagi Matematika UAD`

Maka kata-kata yang muncul pada kalimat di atas adalah `Selamat , Pagi , Matematika , dan UAD`

Representasi ini dilambangkan dengan angka 1 yang bermakna suatu kata `muncul` dan 0 jika suatu kata `tidak muncul`

DataFrame Teks Input

In [109]:
dff = df['content_processed']
dff

0      akun kloning dukung agussilvy serang paslon an...
1      agussilvy bicara kasihan yaa lap air mata wkwk...
2      nunggu hasil qc tp nunggu motif cuit sbyudhoyo...
3      kasi kasi peluru rw agussilvy mempan menangin ...
4      maaf dukung agussilvy hayo dukung aniessandi p...
                             ...                        
895    bpk aniesbaswedan sandiuno lihat rspun selfie ...
896    rangkul orang batas usia kelamin okeoce ok han...
897    jago bidang digital smiling face with sunglass...
898                                    pesanbijak okeoce
899    sandiaga bangun rumah dp simpel banding tol ci...
Name: content_processed, Length: 900, dtype: object

In [110]:
preprocessed_dff = []
for i in lower_case_dff:
    preprocessed_dff.append(i.split(' '))
print(preprocessed_dff)

[['akun', 'kloning', 'dukung', 'agussilvy', 'serang', 'paslon', 'aniessandi', 'opini', 'argumen', 'pmbenaran', 'kecoh'], ['agussilvy', 'bicara', 'kasihan', 'yaa', 'lap', 'air', 'mata', 'wkwkwkwk'], ['nunggu', 'hasil', 'qc', 'tp', 'nunggu', 'motif', 'cuit', 'sbyudhoyono', 'pasca', 'agussilvy', 'nyungsep'], ['kasi', 'kasi', 'peluru', 'rw', 'agussilvy', 'mempan', 'menangin', 'pilkada', 'quickcount'], ['maaf', 'dukung', 'agussilvy', 'hayo', 'dukung', 'aniessandi', 'putar', 'ronavioleta', 'netizentofa'], ['aneh', 'lebay', 'sangkut', 'paut', 'kandidat', 'calgub', 'lebay', 'dukung', 'agussilvy'], ['allah', 'swt', 'syukur', 'sbyudhoyono', 'presidensby', 'ahy', 'terimakasih', 'makna'], ['terima', 'kasih', 'dukung', 'ahy', 'sylvi', 'beda'], ['trima', 'kasih', 'keistiqomahan', 'rawan', 'ahy', 'agusyudhoyono', 'tlh', 'jakartagubernurbaru', 'dg', 'signifikan'], ['kenang', 'pidato', 'kalah', 'ahy'], ['dammnn', 'politik', 'cantik', 'sby', 'ngorbanin', 'ahy'], ['batal', 'nyoblos', 'nyata', 'no', 'kert

In [111]:
indeks_kata = []
import pprint
from collections import Counter

for i in preprocessed_dff:
    indeks_kata.append(Counter(i))
    
pprint.pprint(indeks_kata)

[1;30;43mOutput streaming akan dipotong hingga 5000 baris terakhir.[0m
          'sindir': 1,
          'spanduk': 1,
          'bersyariah': 1,
          'djarot': 1,
          'aniesbaswedan': 1}),
 Counter({'youtu': 1,
          'be': 1,
          'virall': 1,
          'bocah': 1,
          'baju': 1,
          'kecewa': 1,
          'wingstourinjakarta': 1,
          'freeahok': 1}),
 Counter({'turun': 1,
          'pangkat': 1,
          'ahok': 1,
          'jakarta': 1,
          'butuh': 1,
          'nista': 1,
          'agama': 1}),
 Counter({'mulut': 1,
          'ahokers': 1,
          'jorok': 1,
          'tanda': 1,
          'kuat': 1,
          'yangbikinberantem': 1}),
 Counter({'pkb': 1,
          'dukung': 1,
          'ahokpenistaagama': 1,
          'putar': 1,
          'tinggal': 1,
          'pilih': 1,
          'umat': 1,
          'islam': 1,
          'milu': 1}),
 Counter({'ahok': 1,
          'djarot': 1,
          'apbd': 1,
          'alat': 1,
    

In [112]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

In [113]:
count_vector.fit_transform(dff).todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [114]:
count_vector.get_feature_names()

['aa',
 'aagym',
 'aamiin',
 'abaaah',
 'abal',
 'abang',
 'abdi',
 'abis',
 'absurd',
 'abunawas',
 'aburizal',
 'abused',
 'acara',
 'acu',
 'adaaqua',
 'adalahkita',
 'addiems',
 'adekku',
 'adil',
 'adjierimbawan',
 'adjrot',
 'agam',
 'agama',
 'agus',
 'agusharimurtiyudhoyono',
 'agussilvy',
 'agussylvi',
 'agusyudhoyono',
 'ah',
 'aher',
 'ahhhh',
 'ahli',
 'ahmadfuadanwar',
 'ahok',
 'ahokbtp',
 'ahokdicintairakyat',
 'ahokdjarot',
 'ahokdjarota',
 'ahokdjarotdirosi',
 'ahokdjarotlebihbaik',
 'ahokdjarotmenang',
 'ahoker',
 'ahokers',
 'ahokfornobel',
 'ahokfree',
 'ahokhebat',
 'ahokishope',
 'ahokjarot',
 'ahokkalah',
 'ahokkeok',
 'ahoklovers',
 'ahokmandikembang',
 'ahokmartir',
 'ahokmenang',
 'ahokpanikahokkalah',
 'ahokpenistaagama',
 'ahokrekormuri',
 'ahokselaludihati',
 'ahokshow',
 'ahoktakbersalah',
 'ahoktumbang',
 'ahox',
 'ahy',
 'ahya',
 'ahycenter',
 'ahyfansclub',
 'ahyforall',
 'ahyhargamati',
 'ahylovers',
 'ahymaininsara',
 'ahysylvi',
 'aib',
 'aiman',
 'a

In [115]:
df_array = count_vector.transform(dff).toarray()
df_array

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [116]:
indeks_matrik = pd.DataFrame(df_array,index=dff,columns=count_vector.get_feature_names())
indeks_matrik

Unnamed: 0_level_0,aa,aagym,aamiin,abaaah,abal,abang,abdi,abis,absurd,abunawas,aburizal,abused,acara,acu,adaaqua,adalahkita,addiems,adekku,adil,adjierimbawan,adjrot,agam,agama,agus,agusharimurtiyudhoyono,agussilvy,agussylvi,agusyudhoyono,ah,aher,ahhhh,ahli,ahmadfuadanwar,ahok,ahokbtp,ahokdicintairakyat,ahokdjarot,ahokdjarota,ahokdjarotdirosi,ahokdjarotlebihbaik,...,weedbegoodtogether,welcome,wiiih,wilayah,wingstourinjakarta,wirausaha,wisata,wisdom,with,without,wkkwwkkwk,wkwkwkwk,wkwkwkwkkwkk,wpap,wujud,yaa,yaaa,yaaaa,yah,yahudi,yak,yangbikinberantem,ye,yee,yep,yes,yme,ynwa,yogia,youtu,youtube,yra,yudhoyono,yusuf,zalim,zaraz,zarazettirazr,zarazettirazz,zipper,zona
content_processed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
akun kloning dukung agussilvy serang paslon aniessandi opini argumen pmbenaran kecoh,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
agussilvy bicara kasihan yaa lap air mata wkwkwkwk,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
nunggu hasil qc tp nunggu motif cuit sbyudhoyono pasca agussilvy nyungsep,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
kasi kasi peluru rw agussilvy mempan menangin pilkada quickcount,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
maaf dukung agussilvy hayo dukung aniessandi putar ronavioleta netizentofa,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
bpk aniesbaswedan sandiuno lihat rspun selfie okeoce ok hand debatpilkadadki ahokpanikahokkalah,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
rangkul orang batas usia kelamin okeoce ok hand salambersama pks temanahok victory hand,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
jago bidang digital smiling face with sunglasses ok hand thonyleong com okeoce salambersama,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
pesanbijak okeoce,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [120]:
indeks_matrik.to_csv('Pilkada_DKI_BoW.csv', index=False)

In [122]:
! cat.Pilkada_DKI_BoW.csv

/bin/bash: cat.Pilkada_DKI_BoW.csv: command not found


#**TF-IDF**

TF-IDF adalah metode untuk merepresentasikan kata-kata dalam angka. TF-IDF merupakan gabungan dari Term Frequency(TF) dan Inverse Document Frequency(IDF).

##**Term Frequency(TF)**

Term Frequency(TF) merupakan frekuensi kemunculan kata i pada kalimat j dibagi dengan total kata pada kalimat j. TF mengukur seberapa sering kata muncul dalam sebuah kalimat. Masing-masing kalimat memiliki panjang(jumlah kata) yang berbeda-beda, maka pada kalimat yang lebih panjang sebuah kata bisa muncul
lebih banyak dari pada kalimat yang pendek.

$$TF(kata)=\frac{Jumlah\space kata\space i\space muncul\space dalam\space kalimat}{Jumlah\space kata\space dalam\space kalimat}$$

##**Inverse Document Frequency(IDF)**

Inverse Document Frequency(IDF) mengukur seberapa penting kata tersebut. Saat menghitung TF kita menganggap semua kata sama penting. Bagaimanapun kita tahu bahwa ada beberapa kata yang sering muncul namun sebenarnya tidak penting seperti `adalah , pada , ini` , dll. Maka semakin tinggi nilai IDF dari sebuah kata, semakin tidak penting pula kata tersebut dalam sebuah kalimat. Pada dasarnya IDF(kata) dihitung sebagai berikut:

$$IDF(kata)=\log_{e} \frac{Jumlah\space kalimat}{Jumlah\space kalimat\space mengandung\space `kata`}$$

$$IDF(kata)=\log_{e} \frac{1+Jumlah\space kalimat}{1+Jumlah\space kalimat\space mengandung\space `kata`}+1$$

DataFrame Teks Input

In [151]:
from sklearn.feature_extraction.text import TfidfVectorizer
vektor_tfidf = TfidfVectorizer()

In [152]:
respon = vektor_tfidf.fit_transform(dff)
print(respon)

  (0, 1079)	0.3599785997985998
  (0, 1794)	0.3599785997985998
  (0, 177)	0.3599785997985998
  (0, 1619)	0.3599785997985998
  (0, 139)	0.14072503942312334
  (0, 1683)	0.2409346974377567
  (0, 2075)	0.31358922656390953
  (0, 25)	0.24600165790359294
  (0, 586)	0.1771673883102274
  (0, 1138)	0.3599785997985998
  (0, 89)	0.2897941874782868
  (1, 2498)	0.4111339554604735
  (1, 1368)	0.33869690110811224
  (1, 73)	0.3710549131714833
  (1, 1238)	0.4111339554604735
  (1, 2502)	0.33869690110811224
  (1, 1058)	0.33869690110811224
  (1, 319)	0.31807330115348853
  (1, 25)	0.28096013129759334
  (2, 1595)	0.28873803714518453
  (2, 1681)	0.30168209621717795
  (2, 2023)	0.23930672486165191
  (2, 477)	0.2786978449109964
  (2, 1454)	0.31992572282294757
  (2, 2355)	0.31992572282294757
  :	:
  (897, 2306)	0.38006819008878084
  (897, 534)	0.38006819008878084
  (897, 320)	0.358394965180424
  (897, 1988)	0.2760539420842905
  (897, 1607)	0.20007696915047873
  (897, 936)	0.33108993100747464
  (897, 2185)	0.30596

In [155]:
vektor_tfidf.get_feature_names()

['aa',
 'aagym',
 'aamiin',
 'abaaah',
 'abal',
 'abang',
 'abdi',
 'abis',
 'absurd',
 'abunawas',
 'aburizal',
 'abused',
 'acara',
 'acu',
 'adaaqua',
 'adalahkita',
 'addiems',
 'adekku',
 'adil',
 'adjierimbawan',
 'adjrot',
 'agam',
 'agama',
 'agus',
 'agusharimurtiyudhoyono',
 'agussilvy',
 'agussylvi',
 'agusyudhoyono',
 'ah',
 'aher',
 'ahhhh',
 'ahli',
 'ahmadfuadanwar',
 'ahok',
 'ahokbtp',
 'ahokdicintairakyat',
 'ahokdjarot',
 'ahokdjarota',
 'ahokdjarotdirosi',
 'ahokdjarotlebihbaik',
 'ahokdjarotmenang',
 'ahoker',
 'ahokers',
 'ahokfornobel',
 'ahokfree',
 'ahokhebat',
 'ahokishope',
 'ahokjarot',
 'ahokkalah',
 'ahokkeok',
 'ahoklovers',
 'ahokmandikembang',
 'ahokmartir',
 'ahokmenang',
 'ahokpanikahokkalah',
 'ahokpenistaagama',
 'ahokrekormuri',
 'ahokselaludihati',
 'ahokshow',
 'ahoktakbersalah',
 'ahoktumbang',
 'ahox',
 'ahy',
 'ahya',
 'ahycenter',
 'ahyfansclub',
 'ahyforall',
 'ahyhargamati',
 'ahylovers',
 'ahymaininsara',
 'ahysylvi',
 'aib',
 'aiman',
 'a

In [156]:
respon.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [160]:
df_tfidf = pd.DataFrame(respon.todense().T, index=vektor_tfidf.get_feature_names(), columns=[f'A{i+1}' for i in range(len(dff))])
df_tfidf

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,A17,A18,A19,A20,A21,A22,A23,A24,A25,A26,A27,A28,A29,A30,A31,A32,A33,A34,A35,A36,A37,A38,A39,A40,...,A861,A862,A863,A864,A865,A866,A867,A868,A869,A870,A871,A872,A873,A874,A875,A876,A877,A878,A879,A880,A881,A882,A883,A884,A885,A886,A887,A888,A889,A890,A891,A892,A893,A894,A895,A896,A897,A898,A899,A900
aa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.345315,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.27472,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aagym,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aamiin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abaaah,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zaraz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zarazettirazr,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zarazettirazz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.377723,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zipper,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [161]:
df_tfidf.to_csv('Pilkada_DKI_TFIDF.csv', index=False)

In [162]:
!ls

Pilkada_DKI_BoW.csv  Pilkada_DKI_TFIDF.csv  sample_data


In [163]:
! cat.Pilkada_DKI_TFIDF.csv

/bin/bash: cat.Pilkada_DKI_TFIDF.csv: command not found
