<a href="https://colab.research.google.com/github/OtnielDegei/Festival-Danau-Paniai.Github.io/blob/main/Analisis_Sentimen_Unsupervised_Lexical_81c444.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
rizkia14_nlp_bahasa_resources_path = kagglehub.dataset_download('rizkia14/nlp-bahasa-resources')
rizkia14_pln_mobile_path = kagglehub.dataset_download('rizkia14/pln-mobile')
rizkia14_kamus_sentimen_path = kagglehub.dataset_download('rizkia14/kamus-sentimen')
rizkia14_movie_review_path = kagglehub.dataset_download('rizkia14/movie-review')

print('Data source import complete.')


## Pendahuluan

![](https://static1.squarespace.com/static/5daddb33ee92bf44231c2fef/t/5f0387d2f6724b5987a29311/1594066902743/natural%2Blanguage%2Bprocessing%2Bin%2Bhealthcare%2B-%2Bforesee%2Bmedical.gif?format=1500w)
![](https://miro.medium.com/proxy/1*_JW1JaMpK_fVGld8pd1_JQ.gif)


Analisis sentimen adalah proses untuk mengidentifikasi, mengekstrak, dan mengevaluasi opini, perasaan, atau sentimen yang terkandung dalam teks, seperti doc pelanggan, posting media sosial, artikel berita, atau teks lainnya. Tujuan utama dari analisis sentimen adalah untuk memahami apakah suatu teks mengandung sentimen positif, negatif, atau netral, serta sejauh mana sentimen tersebut diekspresikan.

Terdapat beberapa metode untuk melakukan analisis sentimen, termasuk:
* Pendekatan berbasis aturan: Menggunakan aturan atau daftar kata kunci untuk menilai sentimen teks.
* Pendekatan berbasis statistik: Menggunakan teknik seperti analisis regresi logistik atau mesin pembelajaran untuk mengklasifikasikan teks.
* Pembelajaran Mendalam (Deep Learning): Memanfaatkan jaringan saraf tiruan (neural networks) untuk analisis sentimen yang lebih kompleks.

Data untuk analisis sentimen dapat diperoleh dari berbagai sumber, seperti doc pelanggan, media sosial, survei, atau wawancara. Setiap sumber data mungkin memiliki karakteristik dan tantangan tersendiri.

## Unsupervised Lexical Based Models

Unsupervised lexical models, dalam konteks analisis sentimen, adalah jenis model yang digunakan untuk mengidentifikasi dan menganalisis sentimen dalam teks tanpa memerlukan anotasi atau label sentimen pada data pelatihan. Model ini bergantung pada informasi leksikal atau fitur-fitur linguistik yang ditemukan dalam teks untuk menentukan sentimen positif, negatif, atau netral.

Model leksikon biasanya menggunakan leksikon, juga dikenal sebagai kamus atau kosa kata kata-kata yang secara khusus terkait dengan analisis sentimen. Leksikon ini berisi daftar kata-kata yang terkait dengan sentimen positif dan negatif, polaritas (besaran nilai negatif atau positif), part of speech (POS) tagging, pengklasifikasi subjektivitas (kuat, lemah, netral), suasana hati, dan sebagainya. Kita dapat menggunakan leksikon-leksikon ini dan menghitung sentimen dari dokumen teks dengan mencocokkan kehadiran kata-kata tertentu dari leksikon dan kemudian mempertimbangkan faktor-faktor lain seperti keberadaan parameter negasi, kata-kata sekitar, konteks keseluruhan, frasa, dan nilai polaritas sentimen keseluruhan agregat untuk menentukan nilai sentimen akhir.


Setiap kata dalam teks diberi bobot berdasarkan sejauh mana kata tersebut memiliki sentimen positif atau negatif menurut kamus sentimen. Bobot ini dapat diberikan dalam bentuk skor numerik.

Model unsupervised akan mencoba mendeteksi pola kata atau frasa yang mengindikasikan sentimen. Contohnya, jika dalam sebuah teks terdapat kata-kata positif seperti "baik", "senang", dan "puas", serta kata-kata negatif seperti "buruk" dan "kecewa", model akan mencoba untuk mengekstrak sentimen berdasarkan keseimbangan antara kata-kata ini.

Model akan menghitung skor sentimen keseluruhan berdasarkan pembobotan kata dan pola yang terdeteksi. Skor positif mengindikasikan sentimen positif, skor negatif mengindikasikan sentimen negatif, dan skor netral mengindikasikan ketiadaan sentimen.

Cara paling sederhana memanfaatkan lexicon adalah dengan menghitung kata positif dan negatif dalam suatu kalimat kemudian menampilkan angka 1 (positif),0(netral) atau -1(negatif)

$$
\text{StSc}(x) = \Bigg \{ \begin{matrix}  1 &
\text{kata positif } > \text{negatif} \\ -1 & \text{kata positif } < \text{negatif} \\ 0 & \text{selainnya} \end{matrix}
$$

In [None]:
# Fungsi Load Lexicon
def loadLexicon(file):
    df=open(file,"r",encoding="utf-8", errors='replace')
    data=df.readlines();df.close()
    return [d.strip().lower() for d in data]

In [None]:
fpos = "/kaggle/input/kamus-sentimen/s-pos.txt"
fneg = "/kaggle/input/kamus-sentimen/s-neg.txt"
fnegasi = "/kaggle/input/kamus-sentimen/negasi.txt"

In [None]:
positif, negatif, negasi = loadLexicon(fpos), loadLexicon(fneg), loadLexicon(fnegasi)
print(positif[:10])
print(negatif[:10])
print(negasi[:10])

In [None]:
def prediksiSentiment(kalimat, positif, negatif, negasi):
    posWords = []
    negWords = [w for w in negatif if w in kalimat]
    for w in positif:
        if w in kalimat:
            negated = False
            for n in negasi:
                if n+' '+w in kalimat:
                    negWords.append(n+' '+w)
                    negated = True
                    break
            if not negated:
                posWords.append(w)
    nPos, nNeg = len(posWords), len(negWords)
    if nPos>nNeg:
        return 1
    if nPos<nNeg:
        return -1
    else:
        return 0

In [None]:
Teks = "mie ayam ini enak"
prediksiSentiment(Teks, positif, negatif, negasi)

In [None]:
Teks = "dia tidak hadir di pelatihan"
prediksiSentiment(Teks, positif, negatif, negasi)

Namun metode ini tidak bisa mendeteksi pernyataan seperti ini

In [None]:
Teks = "Ujiannya sangat sulit, aku tidak bisa mengerjakannya. Tapi bohong. Berchandyaaa"
prediksiSentiment(Teks, positif, negatif, negasi)

## IndoBERT

IndoBERT adalah model bahasa yang diadaptasi khusus untuk bahasa Indonesia. Ini merupakan versi dari BERT (Bidirectional Encoder Representations from Transformers) yang telah dilatih untuk memahami teks dalam bahasa Indonesia. Model ini sangat berguna dalam tugas analitik teks seperti pemahaman teks, pengklasifikasian sentimen, dan pemodelan bahasa, karena dapat menghasilkan representasi vektor teks yang kaya dan kontekstual.

BERT adalah sebuah model bahasa yang revolusioner dalam pemrosesan bahasa alami (NLP). Model ini dikembangkan oleh Google AI pada tahun 2018. Berikut adalah beberapa poin kunci tentang BERT:

* Model Bahasa: BERT adalah model bahasa yang mendalam berbasis transformer. Ini memiliki kemampuan untuk memahami konteks dan hubungan antara kata dalam teks dengan cara yang lebih baik daripada model sebelumnya.

* Bidireksional: BERT memproses teks secara simultan dari kedua arah (disebut "bidireksional"), yang berarti ia dapat memahami kata-kata dalam konteks lengkap mereka dalam kalimat, bukan hanya dari kiri ke kanan atau sebaliknya. Ini memberikan pemahaman yang lebih baik tentang arti sebenarnya dari kata-kata dalam suatu konteks.

* Pre-trained dan Fine-tuning: BERT dilatih terlebih dahulu pada korpus teks yang sangat besar, sehingga modelnya memahami banyak aspek bahasa. Kemudian, model BERT ini dapat di-tune ulang untuk tugas-tugas NLP tertentu, seperti pengklasifikasian teks, pemahaman teks, atau tugas-tugas terkait bahasa lainnya.

* State-of-the-Art: Ketika BERT pertama kali diperkenalkan, itu mencapai hasil yang sangat baik dalam berbagai tugas pemrosesan bahasa alami dan sejak itu menjadi dasar bagi banyak perkembangan terbaru dalam NLP. Model-model berbasis BERT telah menjadi standar de facto untuk tugas-tugas NLP.

* Open Source: Google merilis BERT sebagai perangkat sumber terbuka sehingga peneliti dan pengembang di seluruh dunia dapat menggunakannya dan melakukan fine-tuning sesuai dengan kebutuhan mereka.

## Implementation of Indobert

Latihan kali ini akan melakukan analisis sentimen review pengguna aplikasi PLN Mobile. Referensi syntax milik [YosefOwenM-0905](https://github.com/YosefOwenM-0905/Implementation-Of-Indobert-On-UserReviewsOfThePlnMobile-Application-BasedOnIndonesianLanguageLexicon), digunakan untuk pembelajaran.

In [None]:
import pandas as pd
import numpy as np
review = pd.read_csv('/kaggle/input/pln-mobile/review-pln-mobile.csv', sep=",")
review

### Text Pre-processing

In [None]:
pip install nlp-id

In [None]:
import pandas as pd
import string
import re
import json
from nlp_id.tokenizer import Tokenizer
from nlp_id.stopword import StopWord
from nlp_id.lemmatizer import Lemmatizer

In [None]:
#import kamus bahasa baku
with open('/kaggle/input/nlp-bahasa-resources/combined_slang_words.txt') as f:
    data0 = f.read()
print("Data type before reconstruction : ", type(data0))
formal_indo = json.loads(data0)
print("Data type after reconstruction : ", type(formal_indo))

In [None]:
def informal_to_formal_indo(text):
    res = " ".join(formal_indo.get(ele, ele) for ele in text.split())
    return(res)

In [None]:
tokenizer = Tokenizer()
stopword = StopWord()
lemmatizer = Lemmatizer()

In [None]:
def my_tokenizer(doc):
    doc = re.sub(r'@[A-Za-z0-9]+', '', doc)
    doc = re.sub(r'#[A-Za-z0-9]+', '', doc)
    doc = re.sub(r'RT[\s]', '', doc)
    doc = re.sub(r"http\S+", '', doc)
    doc = re.sub(r'[0-9]+', '', doc)
    doc = re.sub(r'(.)\1+',r'\1\1', doc)
    doc = re.sub(r'[\?\.\!]+(?=[\?.\!])', '',doc)
    doc = re.sub(r'[^a-zA-Z]',' ', doc)
    doc = re.sub(r'\b(\w+)( \1\b)+', r'\1', doc)
    doc = doc.replace('\n', ' ')
    doc = doc.translate(str.maketrans('', '', string.punctuation))
    doc = doc.strip(' ')
    #Mengubah menjadi huruf kecil
    doc = doc.lower()
    #Text Normalization
    doc = informal_to_formal_indo(doc)
    #Punctuation Removal+Menghapus Angka
    doc = doc.translate(str.maketrans('', '', string.punctuation + string.digits))
    #Whitespace Removal
    doc = doc.strip()
    #Tokenization
    doc = tokenizer.tokenize(doc)
    doc_token1 = [word for word in doc]
    #Stopwords Removal
    doc_token2 = [word for word in doc_token1 if word not in stopword.get_stopword()]
    #Lemmatization
    doc_token3 = [lemmatizer.lemmatize(word) for word in doc_token2]
    return doc_token3

In [None]:
#text  pre-processing
review['preprocessing'] = review['ulasan'].apply(my_tokenizer)
review[['ulasan', 'preprocessing']]

In [None]:
review1=review[['ulasan', 'preprocessing']]

In [None]:
!pip install sastrawi
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

def stemming(ulasan) :
  factory = StemmerFactory()
  stemmer = factory.create_stemmer()
  do = []
  for w in ulasan:
    dt = stemmer.stem(w)
    do.append(dt)
  d_clean = []
  d_clean = " ".join(do)
  return d_clean

review['stemming_ulasan'] = review['preprocessing'].apply(stemming)
review[['stemming_ulasan']]

### Labeling with Inset Lexicon

In [None]:
lexicon_positive = pd.read_excel('/kaggle/input/kamus-sentimen/kamus_positif.xlsx')
lexicon_positive_dict = {}
for index, row in lexicon_positive.iterrows():
    if row[0] not in lexicon_positive_dict:
        lexicon_positive_dict[row[0]] = row[1]

lexicon_negative = pd.read_excel('/kaggle/input/kamus-sentimen/kamus_negatif.xlsx')
lexicon_negative_dict = {}
for index, row in lexicon_negative.iterrows():
    if row[0] not in lexicon_negative_dict:
        lexicon_negative_dict[row[0]] = row[1]

def sentiment_analysis_lexicon_indonesia(ulasan):
    score = 0
    for word in ulasan:
        if (word in lexicon_positive_dict):
            score = score + lexicon_positive_dict[word]
    for word in ulasan:
        if (word in lexicon_negative_dict):
            score = score + lexicon_negative_dict[word]
    sentimen=''
    if (score > 0):
        sentimen = 'positif'
    elif (score < 0):
        sentimen = 'negatif'
    else:
        sentimen = 'netral'
    return score, sentimen

results = review['preprocessing'].apply(sentiment_analysis_lexicon_indonesia)
results = list(zip(*results))
review['label'] = results[0]
#data['sentimen'] = results[1]
#data

review['label'] = results[1]
dataSentimen = review
data_inset = review

data_inset[['ulasan', 'preprocessing', 'label']]

In [None]:
data = review[['stemming_ulasan', 'label']]
data

In [None]:
data['label'].value_counts()

In [None]:
data['label'].value_counts().plot.pie(autopct='%.2f')

In [None]:
data.replace(to_replace='negatif', value=0, inplace=True)
data.replace(to_replace='positif', value=1, inplace=True)
data.replace(to_replace='netral', value=2, inplace=True)
data.head()

### Split Data

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(data, test_size=0.2)
df_val, df_test = train_test_split(df_test, test_size=0.5)
df_train.shape, df_test.shape, df_val.shape
print('Training data shape:', df_train.shape)
print('Validation data shape:', df_val.shape)
print('Test data shape:', df_test.shape)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(5, 5))
sns.countplot(x=df_train['label'])
plt.show()

In [None]:
df_train.to_csv('data_training.csv', index = False)

In [None]:
data = pd.read_csv('data_training.csv')
data.head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(5, 5))
sns.countplot(x=df_val['label'])
plt.show()

In [None]:
df_val.to_csv('data_validasi.csv', index = False)

In [None]:
data = pd.read_csv('data_validasi.csv')
data.head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(5, 5))
sns.countplot(x=df_test['label'])
plt.show()

In [None]:
df_test.to_csv('data_testing.csv', index = False)

In [None]:
data = pd.read_csv('data_testing.csv')
data.head()

### Indobert Model

In [None]:
#Modelling
!pip install transformers

In [None]:
from transformers import BertTokenizer

# Load tokenizer dari pre-trained model
bert_tokenizer = BertTokenizer.from_pretrained('indobenchmark/indobert-base-p2')

In [None]:
# View vocabulary from pre-trained models that have been preloaded
vocabulary = bert_tokenizer.get_vocab()
print('Panjang vocabulary:', len(vocabulary))

In [None]:
print(vocabulary)

In [None]:
# Example of Tokenization
# Retrieve the 1st index data on the dataframe
print('Kalimat:', review['stemming_ulasan'][0])
print('BERT Tokenizer:', bert_tokenizer.tokenize(review['stemming_ulasan'][0]))

In [None]:
# Example of input formatting for BERT.
# Input formatting can use 'encode_plus' function
bert_input = bert_tokenizer.encode_plus(
    # Sample sentences
    review['stemming_ulasan'][0],
    # Add [CLS] token at the beginning of the sentence & [SEP] token at the end of the sentence
    add_special_tokens = True,
    # Add padding to max_length using [PAD] token
    # jika kalimat kurang dari max_length
    padding = 'max_length',
    # Truncate if sentence is more than max_length
    truncation = 'longest_first',
    # Determine the max_length of the entire sentence
    max_length = 50,
    # Returns the attention mask value
    return_attention_mask = True,
    # Returns the value of token type id (segment embedding)
    return_token_type_ids =True)
# The function 'encode_plus' returns 3 values:
# input_ids, token_type_ids, attention_mask
bert_input.keys()

In [None]:
# Original data
print('Kalimat\t\t:', review['stemming_ulasan'][0]) #1 denotes first order data or first review data
                                                   # so for example I change it to 1000 still 1 data appears but the order is 1000th
# Input formatting + tokenizer return
print('Tokenizer\t:', bert_tokenizer.convert_ids_to_tokens(bert_input['input_ids']))
# Input IDs: token indexes in the tokenizer vocabulary
print('Input IDs\t:', bert_input['input_ids'])
# Token type IDs: shows the sequence of sentences in the sequence (segment embedding)
print('Token Type IDs\t:', bert_input['token_type_ids'])
# Attention mask : returns value [0,1].
#1 means masked token, 0 tokens are not masked (ignored)
print('Attention Mask\t:', bert_input['attention_mask'])

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# There are many ways to define max_length
# The intuition is that we don't want to cut sentences
# Or added too much padding (longer computation)

# In this example, max_length is determined from the distribution of tokens in the dataset
token_lens = []
for txt in review['stemming_ulasan']:
  tokens = bert_tokenizer.encode(txt)
  token_lens.append(len(tokens))
sns.histplot(token_lens, kde=True, stat='density', linewidth=0)
plt.xlim([0, 100]);
plt.xlabel('Token count');

In [None]:
# Create a function to combine tokenization steps
# Added special tokens for all data as input formatting to the BERT model
def convert_example_to_feature(sentence):
  return bert_tokenizer.encode_plus(
      sentence,
      add_special_tokens=True,
      padding='max_length',
      truncation='longest_first',
      max_length=42,
      return_attention_mask=True,
      return_token_type_ids=True)

In [None]:
# Create a function to map input formatting results to match the BERT model
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
  return {
      "input_ids": input_ids,               # Sebagai token embedding
      "token_type_ids": token_type_ids,     # Sebagai segment embedding
      "attention_mask": attention_masks,    # Sebagai filter informasi mana yang kalkulasi oleh model
  }, label

In [None]:
import tensorflow as tf
# Create a function to iterate or encode each sentence in the entire data
def encode(data):
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  label_list = []

  for sentence, label in data.to_numpy():
    bert_input = convert_example_to_feature(sentence)
    input_ids_list.append(bert_input['input_ids'])
    token_type_ids_list.append(bert_input['token_type_ids'])
    attention_mask_list.append(bert_input['attention_mask'])
    label_list.append([label])
  return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)

In [None]:
# Perform input formatting using the previous function on the data as a whole
train_encoded = encode(df_train).batch(32)
test_encoded = encode(df_test).batch(32)
val_encoded = encode(df_val).batch(32)

In [None]:
from transformers import TFBertForSequenceClassification

# Load model
bert_model = TFBertForSequenceClassification.from_pretrained(
    'indobenchmark/indobert-base-p2', num_labels=3)

In [None]:
# Compile model
bert_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.00003),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.keras.metrics.SparseCategoricalAccuracy('accuracy'))

In [None]:
%%time
bert_history = bert_model.fit(train_encoded, epochs=5,
                              batch_size=32, validation_data=val_encoded)

In [None]:
# Create a function for plotting training results
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel('Epochs')
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

In [None]:
plot_graphs(bert_history, 'accuracy')
plot_graphs(bert_history, 'loss')

In [None]:
print('\nEpoch No.  Train Accuracy  Train Loss      Val Accuracy    Val Loss')
for i in range(5):
  print('{:8d} {:10f} \t {:10f} \t {:10f} \t {:10f}'.format(i + 1, bert_history.history['accuracy'][i],
                                                            bert_history.history['loss'][i],
                                                            bert_history.history['val_accuracy'][i],
                                                            bert_history.history['val_loss'][i]))


In [None]:
bert_model.save_weights('bert-model.h5')

In [None]:
%%time
score = bert_model.evaluate(test_encoded)
print("Test Accuracy:", score[1])

In [None]:
predicted_raw = bert_model.predict(test_encoded)

In [None]:
y_pred = np.argmax(predicted_raw['logits'], axis=1)
y_true = np.array(df_test['label'])

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

accuracy_score(y_true, y_pred)

In [None]:
confusion_matrix(y_true, y_pred)

In [None]:
print(classification_report(y_true, y_pred))

In [None]:
# Load fine-tuning results
bert_load_model = TFBertForSequenceClassification.from_pretrained(
    'indobenchmark/indobert-base-p2', num_labels=3)
bert_load_model.load_weights('bert-model.h5')

In [None]:
# Sample text
input_text = 'tolong dong diupgrade untuk di tambah riwayat penggunaan listrik harian supaya kita bisa lebih mudah lagi mengontrol penggunaan listrik setiap harinya terimakasih'

# Encode input text
input_text_tokenized = bert_tokenizer.encode(input_text,
                                             truncation=True,
                                             padding='max_length',
                                             return_tensors='tf')

In [None]:
# Make predictions
bert_predict = bert_load_model(input_text_tokenized)
# Softmax function to get classification results
bert_output = tf.nn.softmax(bert_predict[0], axis=-1)

In [None]:
sentiment_labels = ['netral','negatif', 'positif']
label = tf.argmax(bert_output, axis=1)
label = label.numpy()

In [None]:
print(input_text, ':', sentiment_labels[label[0]])

### Confusion Matrix

In [None]:
import seaborn as sn
from pandas import DataFrame
confm = confusion_matrix(y_true, y_pred)
columns = ['negatif','positif','netral']
df_cm = DataFrame(confm, index=columns, columns=columns)
ax = sn.heatmap(df_cm, cmap='Blues', annot=True)
ax.set_title('Confusion matrix')
ax.set_xlabel('Label prediksi')
ax.set_ylabel('Label sebenarnya')

## WordCloud

In [None]:
df_GROUPBY_label = data_inset.groupby("label", sort=False)
df_GROUPBY_label.get_group('positif')
datagroup = df_GROUPBY_label[['preprocessing','label']].get_group('positif')
datagroup.to_csv('positif.csv')

In [None]:
positif = pd.read_csv('positif.csv')
%matplotlib inline

import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
allWords = ' '.join([twts for twts in  positif['preprocessing']])
wordCloud = WordCloud(colormap="viridis",background_color='white',
                       width=800, height=800, random_state=10, max_font_size=200, min_font_size=20).generate(allWords)

plt.figure( figsize=(10,5), facecolor='k', frameon=False)
plt.imshow(wordCloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
df_GROUPBY_label = data_inset.groupby("label", sort=False)
df_GROUPBY_label.get_group('negatif')
datagroup = df_GROUPBY_label[['preprocessing','label']].get_group('negatif')
datagroup.to_csv('negatif.csv')

In [None]:
negatif = pd.read_csv('negatif.csv')
%matplotlib inline

import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
allWords = ' '.join([twts for twts in  negatif['preprocessing']])
wordCloud = WordCloud(colormap="viridis", background_color='white',
                       width=800, height=800, random_state=10, max_font_size=200, min_font_size=20).generate(allWords)

plt.figure( figsize=(10,5), facecolor='k', frameon=False)
plt.imshow(wordCloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
df_GROUPBY_label = data_inset.groupby("label", sort=False)
df_GROUPBY_label.get_group('netral')
datagroup = df_GROUPBY_label[['preprocessing','label']].get_group('netral')
datagroup.to_csv('netral.csv')

In [None]:
netral = pd.read_csv('netral.csv')
%matplotlib inline

import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
allWords = ' '.join([twts for twts in  netral['preprocessing']])
wordCloud = WordCloud(colormap="viridis", background_color='white',
                       width=800, height=800, random_state=200, max_font_size=200, min_font_size=20).generate(allWords)

plt.figure( figsize=(10,5), facecolor='k', frameon=False)
plt.imshow(wordCloud, interpolation="bilinear")
plt.axis('off')
plt.show()

Selanjutnya kita akan melakukan analisis sentimen ulasan movie menggunakan beberapa lexicon bahasa inggris.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')
nltk.download('sentiwordnet')
nltk.download('wordnet')

In [None]:
import pandas as pd
import numpy as np
import nltk
import textblob
from sklearn.metrics import confusion_matrix, classification_report
np.set_printoptions(precision=2, linewidth=80)

In [None]:
pip install text_normalizer

In [None]:
import text_normalizer as tn

In [None]:
dataset = pd.read_csv('/kaggle/input/movie-review/movie_reviews.csv.bz2', compression='bz2')
dataset.head()

In [None]:
# extract data for model evaluation
reviews = np.array(dataset['review'])
sentiments = np.array(dataset['sentiment'])
test_reviews = reviews[35000:]
test_sentiments = sentiments[35000:]
sample_review_ids = [7626, 3533, 13010]

## Textblob Lexicon

In [None]:
for review, sentiment in zip(test_reviews[sample_review_ids], test_sentiments[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    print('Predicted Sentiment polarity:', textblob.TextBlob(review).sentiment.polarity)
    print('-'*60)

In [None]:
# Predict sentiment for test dataset
sentiment_polarity = [textblob.TextBlob(review).sentiment.polarity for review in test_reviews]

In [None]:
predicted_sentiments = ['positive' if score >= 0.1 else 'negative' for score in sentiment_polarity]

In [None]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, predicted_sentiments))
pd.DataFrame(confusion_matrix(test_sentiments, predicted_sentiments), index=labels, columns=labels)

In [None]:
Text = test_reviews[sample_review_ids]
Real_sentiments = test_sentiments[sample_review_ids]
TextBlob_sentiment_polarity = sentiment_polarity[7626], sentiment_polarity[3533],sentiment_polarity[13010]
TextBlob_predicted_sentiments = predicted_sentiments[7626],predicted_sentiments[3533],predicted_sentiments[13010]

In [None]:
TextBlob_sample_report = {'Text':Text,'Real_sentiments':Real_sentiments,'TextBlob_sentiment_polarity':TextBlob_sentiment_polarity,
                          'TextBlob_predicted_sentiments':TextBlob_predicted_sentiments}
TextBlob_sample_report = pd.DataFrame(data=TextBlob_sample_report)

In [None]:
TextBlob_sample_report

In [None]:
def color_negative_red(value):
  """
  Colors elements in a dateframe
  green if positive and red if
  negative. Does not color NaN
  values.
  """

  if value == 'positive':
    color = 'green'
  else:
    color = 'red'

  return 'color: %s' % color

In [None]:
df = TextBlob_sample_report.copy()

In [None]:
df.style.applymap(color_negative_red, subset=['Real_sentiments','TextBlob_predicted_sentiments'])

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(test_sentiments, predicted_sentiments),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("Sentiment Analysis with TextBlob",fontsize=14)
plt.show()


In [1]:
from sklearn.metrics import accuracy_score
TextBlob_model = accuracy_score(test_sentiments, predicted_sentiments)
print(TextBlob_model)

NameError: name 'test_sentiments' is not defined