# NLP Part 2

Dalam workbook ini anda akan belajar bagaimana cara membangun sentiment analysis dengan menggunakan metode unsupervised machine learning. Berikut merupakan langkah-langkah dalam melakukan teks analitik
1. Preprocessing Data
2. Modelling
3. Evaluasi Model

## Connect Gdrive to Colab
Sebelum memulai, pastikan bahwa google colab anda sudah tersambung dengan google drive anda.


In [1]:
# Mengakses google drive ke dalam google colaboratory
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Import Package

In [2]:
!pip3 install unidecode

Collecting unidecode
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.4.0-py3-none-any.whl (235 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/235.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m235.5/235.8 kB[0m [31m8.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.8/235.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.4.0


In [3]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m70.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [4]:
# Import Package

import os
import re
import multiprocessing
import pandas as pd
import numpy as np
from time import time
from unidecode import unidecode

import nltk
from nltk.util import ngrams

from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser

from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from IPython.display import display

In [5]:
# Download Corpus
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
# Mendefinisikan path dan cek keberadaan data
path = r'/content/gdrive/MyDrive/Semester 7/NLP/'


os.listdir(path)

['V2_Text_analytics.ipynb - Alfin.ipynb',
 'drive-download-20251030T073426Z-1-001.zip',
 'drive-download-20251030T073426Z-1-001',
 'archive.zip',
 'Dataset_Macet',
 'Dataset_Macet.csv',
 'result_word2vec.model']

## Load Data
Data yang akan anda gunakan adalah data yang diambil dari twitter, dalam data ini, anda hanya memiliki data text saja. Selanjutnya anda ingin mencari tahu bagaimana sentiment dari user pengguna twitter.

In [7]:
# Load dataset
# df = pd.read_csv(os.path.join(path, 'twitter_dataset.csv'), header=None)
df_ = pd.read_csv(os.path.join(path, 'Dataset_Macet.csv'), encoding = "ISO-8859-1", header=None)
df_.head()

Unnamed: 0,0,1,2
0,index,1st Road Class,1st Road Class Desc
1,0,1,Motorway
2,1,2,A(M)
3,2,3,A
4,3,4,B


In [8]:
# Merubah nama kolom dalam dataframe
df = df_.iloc[ :, -1:]
df.columns = ['text']
df.head()

Unnamed: 0,text
0,1st Road Class Desc
1,Motorway
2,A(M)
3,A
4,B


Sekarang anda telah mengubah nama kolom dan memiliki text data yang akan anda analisis sentimentnya. Namun sebelum itu akan dilakukan cleansing terlebih dahulu untuk memastikan tidak ada data yang duplikasi atau missing.

## Preprocessing
Mencari informasi dari data, baik jumlah baris dan kolom, maupun tipe dari data yang dimiliki.

In [9]:
# Melihat informasi data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    67 non-null     object
dtypes: object(1)
memory usage: 748.0+ bytes


Data yang akan digunakan terdapat 1 kolom dan juga memiliki 73824 baris data. Selanjutnya akan dicari tahu apakah ada data yang missing atau tidak.

In [10]:
# Cek missing value
df.isnull().sum()

Unnamed: 0,0
text,10


Dari data yang ada, tidak ditemukan ada missing value, selanjutnya akan di cari tahu apakah ada data yang duplicate atau tidak.

In [11]:
# Cek data yang duplicated
df[df.duplicated()]

Unnamed: 0,text
14,
23,
34,
39,
44,
48,
49,
50,
51,
59,[Not used]


Dari pengecekan duplicated data, ditemukan terdapat beberapa data yang terduplikasi, selanjutnya akan dilakukan cleaning untuk data yang terduplikasi.

In [12]:
# Drop duplicated data
df = df.drop_duplicates().reset_index(drop=True)

In [13]:
df.shape

(65, 1)

Setelah memastikan bahwa data yang anda miliki sudah tidak terdapat missing value maupun duplikasi data, selanjutnya anda perlu melakukan cleaning pada tiap value dalam text data yang anda miliki. Konsep dari cleaning yang dilakukan sama dengan materi yang anda pelajari sebelumnya, namun terdapat beberapa step yang tidak akan anda gunakan. Berikut proses cleaning yang akan anda lakukan.

In [14]:
# Vocab untuk stopwords
stops = set(nltk.corpus.stopwords.words("english"))

In [15]:
# Data preprocessing

# Format html
html_tag = re.compile(r'<.*?>')
http_link = re.compile(r'https://\S+')
www_link = re.compile(r'www\.\S+')

# Menghilangkan akun user
user_name = re.compile(r'\@[a-z0-9]+')

# Tanda baca yang tidak diperlukan
punctuation = re.compile(r'[^\w\s]')

# Function untuk memproses cleaning teks data
def data_cleaning(text, stopwords = False):
  # unicode text data
  text = unidecode(text)

  # lower casting
  text = text.lower()

  # menghilangkan html tag
  text = re.sub(html_tag, r'', text)

  # menghilangkan url
  text = re.sub(http_link, r'', text)
  text = re.sub(www_link, r'', text)

  # menghilangkan user name
  text = re.sub(user_name, r'', text)

  # menghilangkan tanda baca
  text = re.sub(punctuation, r'', text)

  # Tokenize
  text = text.split()

  # remove stopword
  if stopwords:
    text = [w for w in text if not w in stops]
  return text

In [16]:
# Menerapkan cleaning pada dataset
df = df.dropna(subset=['text']).reset_index(drop=True)
df['tweet_clean'] = df['text'].apply(lambda x: data_cleaning(x, stopwords=True))
df.head()

Unnamed: 0,text,tweet_clean
0,1st Road Class Desc,"[1st, road, class, desc]"
1,Motorway,[motorway]
2,A(M),[]
3,A,[]
4,B,[b]


In [17]:
df.shape

(64, 2)

In [18]:
# Memilih data yang minimal mempunyai 2 kata didalamnya
data_clean = df[df.tweet_clean.str.len() > 1]
data_clean.shape

(45, 2)

Terdapat perbedaan jumlah baris sebelum dan sesudah filtering dengan menggunakan jumlah kata dalam dataset.

In [19]:
# Mengecek kondisi data saat ini
data_clean.head()

Unnamed: 0,text,tweet_clean
0,1st Road Class Desc,"[1st, road, class, desc]"
7,Road Surface Desc,"[road, surface, desc]"
9,Wet / Damp,"[wet, damp]"
11,Frost / Ice,"[frost, ice]"
12,Flood (surface water over 3cm deep),"[flood, surface, water, 3cm, deep]"


## N-Gram
Konsep dari N-gram adalah mentoken kata berdasarkan n kata. N-gram model dapat membantu untuk mengenali beberapa konteks kata yang tidak bisa terpisahkan. Sebelum anda melanjutkan step berikutnya, akan dikenalkan terlebih dahulu mengenai konsep n-gram.

In [20]:
# Konsep n-gram
# Akan ditunjukkan dengan 1 data teratas
unigram = []
bigram = []
trigram = []

for words in data_clean["tweet_clean"][:1]:
  list_bigram = list(ngrams(words, 2))
  list_trigram = list(ngrams(words, 3))
  for word in words:
    unigram.append(word)
  for word in list_bigram:
    bigram.append(word)
  for word in list_trigram:
    trigram.append(word)

print("Kalimat : ", data_clean["tweet_clean"][:1])
print("Unigram : ", unigram)
print("Bigram : ", bigram)
print("Trigram : ", trigram)

Kalimat :  0    [1st, road, class, desc]
Name: tweet_clean, dtype: object
Unigram :  ['1st', 'road', 'class', 'desc']
Bigram :  [('1st', 'road'), ('road', 'class'), ('class', 'desc')]
Trigram :  [('1st', 'road', 'class'), ('road', 'class', 'desc')]


Dari output diatas anda dapat mengetahui bahwa terdapat perbedaan output dalam pemisahan pengelompokan kata. Model N-gram banyak digunakan untuk mencari prediksi kata berikutnya yang akan muncul ketika terdapat history text.

In [21]:
help(Phrases)

Help on class Phrases in module gensim.models.phrases:

class Phrases(_PhrasesTransformation)
 |  Phrases(sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter='_', progress_per=10000, scoring='default', connector_words=frozenset())
 |
 |  Detect phrases based on collocation counts.
 |
 |  Method resolution order:
 |      Phrases
 |      _PhrasesTransformation
 |      gensim.interfaces.TransformationABC
 |      gensim.utils.SaveLoad
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __init__(self, sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter='_', progress_per=10000, scoring='default', connector_words=frozenset())
 |      Parameters
 |      ----------
 |      sentences : iterable of list of str, optional
 |          The `sentences` iterable can be simply a list, but for larger corpora, consider a generator that streams
 |          the sentences directly from disk/network, See :class:`~gensim.models.word2vec.BrownCorpu

In [22]:
# Menggunakan model yang telah ada dan menerapkan untuk dataset yang dimiliki saat ini
sent = [row for row in data_clean.tweet_clean]
phrases = Phrases(sent, min_count=3, progress_per=50000)
bigram = Phraser(phrases)
sentences = bigram[sent]
sentences[1]

['road', 'surface', 'desc']

Syntax mengunakan konsep bigram dimana ketika didalam model dikenali kata yang merupakan 1 kesatuan, maka secara otomatis akan tersambung kedalam 1 kata. Berikut merupakan contohnya.

In [23]:
example = 'commonly known as the united states'
example = example.split()
bigram[example]

['commonly', 'known', 'as', 'the', 'united', 'states']

## POS TAGGING

In [24]:
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

In [25]:
# Akan di contohkan dalam 1 kalimat
pos_tag = []

text = ["European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices."]

# Download the specific English tagger resource
nltk.download('averaged_perceptron_tagger_eng')

for i in text:
  words = i.split()
  words = nltk.pos_tag(words)
  for j in words:
    pos_tag.append(j)

pos_tag

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$5.1', 'NN'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices.', 'NN')]

Keterangan seperti NN, JJ, NNS dan lain-lain menunjukkan susunan dari kata tersebut. Untuk lebih detailnya anda bisa mengkonfirmasi makna dari pos tagging tersebut dengan cara

In [26]:
#gbugfdf
nltk.download('tagsets_json')
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


[nltk_data] Downloading package tagsets_json to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets_json.zip.


## NER (Named Entity Recognition)

In [27]:
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [28]:
nltk.download('maxent_ne_chunker_tab')
chunks = nltk.ne_chunk(pos_tag, binary = True)
for chunk in chunks:
  print(chunk)

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


(NE European/JJ)
('authorities', 'NNS')
('fined', 'VBD')
('Google', 'NNP')
('a', 'DT')
('record', 'NN')
('$5.1', 'NN')
('billion', 'CD')
('on', 'IN')
('Wednesday', 'NNP')
('for', 'IN')
('abusing', 'VBG')
('its', 'PRP$')
('power', 'NN')
('in', 'IN')
('the', 'DT')
('mobile', 'JJ')
('phone', 'NN')
('market', 'NN')
('and', 'CC')
('ordered', 'VBD')
('the', 'DT')
('company', 'NN')
('to', 'TO')
('alter', 'VB')
('its', 'PRP$')
('practices.', 'NN')


In [29]:
entities = []
labels = []
for chunk in chunks:
  if hasattr(chunk, 'label'):
    entities.append(" ".join(c[0] for c in chunk))
    labels.append(chunk.label())

entities_labels = list(set(zip(entities, labels)))
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["entities", "label"]
entities_df

Unnamed: 0,entities,label
0,European,NE


## Word2Vec
Word2vec adalah salah satu konsep embedding yang akan mengubah kata kedalam bentuk vector angka. Bentuk numerik dari tiap kata ini dapat digunakan untuk menentukan seberapa dekat dan mirip penggolongan kata-kata tersebut.

In [30]:
# Set Parameter
min_count = 3
window = 4
size = 250
sample = 1e-5
negative = 20
workers = multiprocessing.cpu_count()-1

min_count digunakan untuk menghapus huruf yang tidak biasa atau typo seperti kemunculan huruf s sebanyak 3 kali berturut-turut.<br>
window digunakan untuk belajar memprediksi kata yang diberikan dari 4 kata ke kiri, dan hingga 4 kata ke kanan.<br>
size adalah size hidden layer.<br>
sample adalah probalititas kata yang sering muncul.<br>
negative adalah jumlah kata negatif.

In [31]:
# --- Import library ---
import multiprocessing
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import pandas as pd
from time import time
import os

# --- 1. Set Parameter ---
min_count = 3
window = 4
size = 250
sample = 1e-5
negative = 20
workers = multiprocessing.cpu_count() - 1

# --- 2. Baca dataset ---
# Ganti nama file sesuai dataset kamu
# Use the path defined earlier and the correct filename
df = pd.read_csv(os.path.join(path, 'Dataset_Macet.csv'), encoding = "ISO-8859-1", header=None)

# Cek nama kolom untuk tahu mana kolom teks
print(df.columns)

# Misal kolom teks bernama 'text' — ubah sesuai kolom kamu
# Assuming the text column is the last one based on previous cells
text_column = df.columns[-1]

# --- 3. Preprocessing: ubah teks menjadi list of token ---
# Drop rows with NaN in the text column before preprocessing
sentences = [simple_preprocess(str(doc)) for doc in df[text_column].dropna()]

# --- 4. Membuat dictionary / model Word2Vec ---
w2v_model = Word2Vec(min_count=min_count,
                     window=window,
                     vector_size=size,
                     sample=sample,
                     negative=negative,
                     workers=workers)

start = time()

# --- 5. Bangun vocab dari data token ---
w2v_model.build_vocab(sentences, progress_per=50000)

print('Waktu untuk membangun vocab : {} menit'.format(round((time() - start) / 60, 2)))

Index([0, 1, 2], dtype='int64')
Waktu untuk membangun vocab : 0.0 menit


In [32]:
# Membangun model word2vec
start = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Waktu untuk train model : {} menit'.format(round((time() - start) / 60, 2)))

w2v_model.init_sims(replace=True)

  w2v_model.init_sims(replace=True)


Waktu untuk train model : 0.0 menit


In [34]:
# Menyimpan model
model_name = "result_word2vec.model"
w2v_model.save(os.path.join(path, model_name))

In [33]:
from gensim.models import Phrases
from gensim.models.phrases import Phraser

# Pastikan tweet_clean berisi list token
# Contoh: [['saya', 'suka', 'produk', 'ini'], ['layanan', 'sangat', 'baik']]
sentences = data_clean['tweet_clean']

# 1️⃣ Buat model bigram dari data token
bigram_model = Phrases(sentences, min_count=5, threshold=10)

# 2️⃣ Buat Phraser agar lebih efisien
bigram = Phraser(bigram_model)

# 3️⃣ Terapkan model bigram ke dataset
data_final = data_clean.copy()
data_final['old_text'] = data_final['tweet_clean'].apply(lambda x: ' '.join(x))

# 4️⃣ Terapkan bigram dengan cara yang benar
data_final['tweet_clean'] = data_final['tweet_clean'].apply(lambda x: bigram[x])

# 5️⃣ Gabungkan kembali menjadi teks utuh
data_final['tweet_clean'] = data_final['tweet_clean'].apply(lambda x: ' '.join(x))

# 6️⃣ Cek hasilnya
data_final.head()


Unnamed: 0,text,tweet_clean,old_text
0,1st Road Class Desc,1st road class desc,1st road class desc
7,Road Surface Desc,road surface desc,road surface desc
9,Wet / Damp,wet damp,wet damp
11,Frost / Ice,frost ice,frost ice
12,Flood (surface water over 3cm deep),flood surface water 3cm deep,flood surface water 3cm deep


## Modelling
Untuk mencari label dari text yang ada, anda akan menggunakan konsep clustering text. Akan dicontohkan untuk membagi text kedalam 2 cluster, positif dan negatif dengan menggunakan k-means.

In [35]:
# Membangun model dengan 2 cluster
word_vectors = w2v_model.wv
model = KMeans(n_clusters=2, max_iter=1000, random_state=True, n_init=50).fit(X=word_vectors.vectors.astype('double'))

In [36]:
word_vectors.similar_by_vector(model.cluster_centers_[1], topn=10, restrict_vocab=None)

[('goods', 0.44150635600090027),
 ('without', 0.41356420516967773),
 ('present', 0.39526110887527466),
 ('other', 0.3683597445487976),
 ('motorcycle', 0.3606860935688019),
 ('casualty', 0.34424981474876404),
 ('unknown', 0.3292577564716339),
 ('lights', 0.3057979643344879),
 ('high', 0.2934713065624237),
 ('over', 0.2924194037914276)]

In [37]:
word_vectors.similar_by_vector(model.cluster_centers_[0], topn=10, restrict_vocab=None)

[('daylight', 0.33813467621803284),
 ('cc', 0.33416762948036194),
 ('under', 0.3296198844909668),
 ('tonnes', 0.3143388628959656),
 ('with', 0.2965700030326843),
 ('vehicle', 0.2930358052253723),
 ('and', 0.2622346878051758),
 ('passenger', 0.2550296485424042),
 ('winds', 0.25372835993766785),
 ('or', 0.24502070248126984)]

Dari output diatas, maka anda dapat melihat bahwa kata-kata yang muncul dengan cluster center 1 adalah kata-kata positif, oleh karena itu untuk cluster 1 maka didapat sentiment positif dan 0 untuk negatif.

In [38]:
# Mendefinisikan tiap cluster
positive_cluster_index = 1
positive_cluster = model.cluster_centers_[positive_cluster_index]
negative_cluster = model.cluster_centers_[1-positive_cluster_index]


In [39]:
# words = pd.DataFrame(word_vectors.vocab.keys())

words = pd.DataFrame(word_vectors.index_to_key)
words.columns = ['words']

# Membuat dataframe untuk embedding tiap kata
words['vectors'] = words.words.apply(lambda x: word_vectors[f'{x}'])

# words.columns = ['words']

# Membuat dataframe untuk embedding tiap kata
# words['vectors'] = words.words.apply(lambda x: word_vectors[f'{x}'])
words

Unnamed: 0,words,vectors
0,vehicle,"[-0.0059118867, 0.0026324117, 0.056702625, 0.1..."
1,desc,"[0.015663661, -0.028841611, -0.07710454, -0.08..."
2,cc,"[-0.094255455, 0.023005415, -0.009434141, -0.1..."
3,and,"[-0.09628914, -0.016054614, 0.10517717, -0.083..."
4,street,"[0.0785661, -0.01737758, 0.0880832, -0.1051640..."
5,winds,"[0.07160828, -0.07334372, 0.010303873, -0.0645..."
6,high,"[-0.07413948, 0.04364161, 0.022304857, 0.07671..."
7,over,"[0.079334825, 0.107139215, 0.09521731, -0.0430..."
8,or,"[0.029646926, 0.009764689, -0.029292377, 0.107..."
9,lighting,"[0.099635124, -0.049979143, -0.07680682, 0.081..."


In [40]:
# menambah kolom cluster untuk prediksi sentiment tiap kata
words['cluster'] = words.vectors.apply(lambda x: model.predict([np.array(x)]))
words.cluster = words.cluster.apply(lambda x: x[0])
words.head()

Unnamed: 0,words,vectors,cluster
0,vehicle,"[-0.0059118867, 0.0026324117, 0.056702625, 0.1...",0
1,desc,"[0.015663661, -0.028841611, -0.07710454, -0.08...",0
2,cc,"[-0.094255455, 0.023005415, -0.009434141, -0.1...",0
3,and,"[-0.09628914, -0.016054614, 0.10517717, -0.083...",0
4,street,"[0.0785661, -0.01737758, 0.0880832, -0.1051640...",1


In [41]:
# Mapping cluster
words['cluster_value'] = [1 if i==positive_cluster_index else -1 for i in words.cluster]
words['closeness_score'] = words.apply(lambda x: 1/(model.transform([x.vectors]).min()), axis=1)
words['sentiment_coeff'] = words.closeness_score * words.cluster_value
words.head()

Unnamed: 0,words,vectors,cluster,cluster_value,closeness_score,sentiment_coeff
0,vehicle,"[-0.0059118867, 0.0026324117, 0.056702625, 0.1...",0,-1,1.045333,-1.045333
1,desc,"[0.015663661, -0.028841611, -0.07710454, -0.08...",0,-1,1.011823,-1.011823
2,cc,"[-0.094255455, 0.023005415, -0.009434141, -0.1...",0,-1,1.057822,-1.057822
3,and,"[-0.09628914, -0.016054614, 0.10517717, -0.083...",0,-1,1.036264,-1.036264
4,street,"[0.0785661, -0.01737758, 0.0880832, -0.1051640...",1,1,1.011036,1.011036


Sekarang anda telah memiliki sentiment coeficient dari setiap value dalam dataset, selain itu anda juga telah memiliki text yang sudah di cleaning. Selanjutnya akan dibuat dictionary untuk kata dan sentiment untuk kata tersebut.

In [42]:
# konten dalam distionary ini adalah kata sebagai key dan sentiment sebagai value-nya
sentiment_dict = dict(zip(words.words.values, words.sentiment_coeff.values))

In [43]:
tfidf = TfidfVectorizer(tokenizer=lambda y: y.split(), norm=None)
tfidf.fit(data_final.tweet_clean)

# mendapatkan nama-nama feature
features = pd.Series(tfidf.get_feature_names_out())

# mengubah dalam bentuk matrix
transformed = tfidf.transform(data_final.tweet_clean)




In [44]:
# Membuat function untuk mengubah text menggungan tfidf
def create_tfidf_dictionary(transformed_file, features):
    vector_coo = transformed_file.tocoo()
    # Correctly create a dictionary mapping feature names to TF-IDF scores
    tfidf_dict = {}
    for col, data in zip(vector_coo.col, vector_coo.data):
        word = features.iloc[col]
        tfidf_dict[word] = data
    return tfidf_dict

def replace_tfidf_words(x, transformed_file, features):
    dictionary = create_tfidf_dictionary(transformed_file, features)
    # The input x is a single string sentence, split it into words
    return [dictionary.get(word, 0) for word in x.split()]

In [45]:
start = time()

# Apply the function to each sentence in the 'tweet_clean' column
replaced_tfidf_scores = []
for i in range(len(data_final)):
    tfidf_dict = create_tfidf_dictionary(transformed[i], features)
    replaced_tfidf_scores.append([tfidf_dict.get(word, 0) for word in data_final.iloc[i]['tweet_clean'].split()])


print('Waktu untuk train model : {} menit'.format(round((time() - start) / 60, 2)))

Waktu untuk train model : 0.0 menit


In [46]:
# Membuat function untuk mengubah kata dengan sentiment coefficient

def replace_sentiment_words(word, sentiment_dict):
    try:
        out = sentiment_dict[word]
    except KeyError:
        out = 0
    return out

In [47]:
# Mengubah kata dengan sentiment-nya
replaced_closeness_scores = data_final.tweet_clean.apply(lambda x: list(map(lambda y: replace_sentiment_words(y, sentiment_dict), x.split())))
replaced_closeness_scores[:5]

Unnamed: 0,tweet_clean
0,"[0, 0, 0, -1.0118227654917693]"
7,"[0, 0, -1.0118227654917693]"
9,"[0, 0]"
11,"[0, 0]"
12,"[0, 0, 0, 0, 0]"


In [48]:
# Membuat Dataframe hasil hasil labeling sentiment dan di transformasi
results = pd.DataFrame({
    'sentiment_coeff': replaced_closeness_scores,
    'tfidf_scores': replaced_tfidf_scores,
    'sentence': data_final.tweet_clean
})

# menggunakan perkalian dot
results['sentiment_rate'] = results.apply(lambda x: np.array(x.loc['sentiment_coeff']).dot(np.array(x.loc['tfidf_scores'])), axis=1)
results['prediction'] = (results.sentiment_rate>0).astype('int8')
results.head()

Unnamed: 0,sentiment_coeff,tfidf_scores,sentence,sentiment_rate,prediction
0,"[0, 0, 0, -1.0118227654917693]","[4.13549421592915, 3.7300291078209855, 3.73002...",1st road class desc,-2.662527,0
7,"[0, 0, -1.0118227654917693]","[3.7300291078209855, 3.7300291078209855, 2.631...",road surface desc,-2.662527,0
9,"[0, 0]","[4.13549421592915, 4.13549421592915]",wet damp,0.0,0
11,"[0, 0]","[4.13549421592915, 4.13549421592915]",frost ice,0.0,0
12,"[0, 0, 0, 0, 0]","[4.13549421592915, 3.7300291078209855, 4.13549...",flood surface water 3cm deep,0.0,0


In [49]:
# Sederhanakan Output
results = results[['sentence', 'prediction']]
results.head()

Unnamed: 0,sentence,prediction
0,1st road class desc,0
7,road surface desc,0
9,wet damp,0
11,frost ice,0
12,flood surface water 3cm deep,0
