# Proyek UAS: Sentiment Analysis using BERT (Movie Review)
Dominique - 202000216

Di era digital, sistem ulasan film secara online telah mengubah cara penonton berinteraksi dengan konten film. Memahami sentimen yang diungkapkan dalam ulasan adalah hal yang sangat penting bagi para pembuat film, kritikus, dan penggemar film. Proyek ini bertujuan untuk membuat aplikasi sentiment analysis untuk melakukan analisa sentimen pada ulasan film. Dengan mendalami nuansa bahasa yang rumit, proyek ini berusaha untuk mengungkap sentimen, opini, dan reaksi yang diungkapkan penonton terhadap berbagai macam film. Aplikasi sentiment analysis ini akan menggunakan arsitektur transformer BERT.

Analisis sentimen menggunakan BERT adalah tugas yang diskriminatif. Model diskriminatif bertujuan untuk mempelajari batasan antara kelas atau kategori yang berbeda dalam data. Dalam konteks analisis sentimen, tujuannya adalah untuk menentukan sentimen (positif, negatif, atau netral) dari suatu teks tertentu. BERT, sebagai model representasi bahasa terlatih, disesuaikan dengan data sentimen berlabel untuk mengoptimalkan kemampuannya dalam membedakan kelas sentimen yang berbeda. Model ini belajar memetakan teks input ke dalam kelas sentimen tertentu berdasarkan fitur dan pola yang telah dipelajari selama pelatihan.

In [35]:
import os
import shutil
import tarfile
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
import pandas as pd
from bs4 import BeautifulSoup
import re
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### 1. Loading dataset
Data yang dipakai adalah kumpulan data review film dari Imdb yang telah diproses untuk klasifikasi sentimen biner oleh Stanford yang berisi 25.000 ulasan film yang sangat berbeda untuk pelatihan, dan 25.000 untuk pengujian. Ada juga data tambahan yang tidak berlabel untuk digunakan.

In [2]:
current_folder = os.getcwd()

dataset = tf.keras.utils.get_file(
    fname ="aclImdb.tar.gz",
    origin ="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
    cache_dir=  current_folder,
    extract = True)

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [None]:
dataset_path = os.path.dirname(dataset)

# Dataset directory
dataset_dir = os.path.join(dataset_path, 'aclImdb')

Load dataset review film dengan sentimennya dan ubah menjadi dataframe pandas

0: Negatif, 1: Positif

In [7]:
def load_dataset(directory):
    data = {"sentence": [], "sentiment": []}
    for file_name in os.listdir(directory):
        print(file_name)
        if file_name == 'pos':
            positive_dir = os.path.join(directory, file_name)
            for text_file in os.listdir(positive_dir):
                text = os.path.join(positive_dir, text_file)
                with open(text, "r", encoding="utf-8") as f:
                    data["sentence"].append(f.read())
                    data["sentiment"].append(1)
        elif file_name == 'neg':
            negative_dir = os.path.join(directory, file_name)
            for text_file in os.listdir(negative_dir):
                text = os.path.join(negative_dir, text_file)
                with open(text, "r", encoding="utf-8") as f:
                    data["sentence"].append(f.read())
                    data["sentiment"].append(0)

    return pd.DataFrame.from_dict(data)

In [None]:
# Train dataframe
train_dir = os.path.join(dataset_dir,'train')
train_df = load_dataset(train_dir)

# Test dataframe
test_dir = os.path.join(dataset_dir,'test')
test_df = load_dataset(test_dir)

### 2. Preprocessing
Data yang sudah diload dibersihkan dengan melakukan text cleaning untuk  menghilangkan noise dan memastikan bahwa data siap dipakai untuk melatih model.

In [10]:
# Plotting amount of negative and positive dataset
sentiment_counts = train_df['sentiment'].value_counts()

fig =px.bar(x= {0:'Negative', 1:'Positive'},
            y= sentiment_counts.values,
            color=sentiment_counts.index,
            color_discrete_sequence =  px.colors.qualitative.Dark24,
            title='<b>Sentiments Counts')

fig.update_layout(title='Sentiments Counts',
                  xaxis_title='Sentiment',
                  yaxis_title='Counts',
                  template='plotly_dark')

fig.show()
pyo.plot(fig, filename = 'Sentiments Counts.html', auto_open = False)

'Sentiments Counts.html'

Text Cleaning

In [11]:
def text_cleaning(text):
    soup = BeautifulSoup(text, "html.parser")
    text = re.sub(r'\[[^]]*\]', '', soup.get_text())
    pattern = r"[^a-zA-Z0-9\s,']"
    text = re.sub(pattern, '', text)
    return text

In [12]:
# Train dataset
train_df['Cleaned_sentence'] = train_df['sentence'].apply(text_cleaning).tolist()
# Test dataset
test_df['Cleaned_sentence'] = test_df['sentence'].apply(text_cleaning)


The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.


The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.



Data kalimat-kalimat dan sentimennya dipisah.

In [16]:
# Training data
Reviews = train_df['Cleaned_sentence']
Target = train_df['sentiment']

# Test data
test_reviews = test_df['Cleaned_sentence']
test_targets = test_df['sentiment']

Seluruh data test kemudian dipisah menjadi test dan validasi

In [17]:
x_val, x_test, y_val, y_test = train_test_split(test_reviews,
                                                    test_targets,
                                                    test_size=0.5,
                                                    stratify = test_targets)

### 3. Tokenization dan Encoding
Tokenization dan Encoding dilakukan menggunakan BERT tokenization.

Tokenization dilakukan untuk mengubah rangkaian teks dataset menjadi bagian-bagian yang lebih kecil, yaitu token, untuk merepresentasikan teks dengan cara yang bermakna bagi mesin tanpa kehilangan konteksnya.

Encoding dilakukan untuk mengubah teks yang bermakna menjadi representasi angka/vektor untuk menjaga konteks dan hubungan antara kata dan kalimat, sehingga mesin dapat memahami pola yang terkait dalam teks apa pun dan dapat melihat konteks kalimat.

In [18]:
#Tokenize and encode the data using the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

BERT tokenization diterapkan dalam dataset train, test, dan validation

In [19]:
max_len= 128
# Tokenize and encode the sentences
X_train_encoded = tokenizer.batch_encode_plus(Reviews.tolist(),
                                              padding=True,
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')

X_val_encoded = tokenizer.batch_encode_plus(x_val.tolist(),
                                              padding=True,
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')

X_test_encoded = tokenizer.batch_encode_plus(x_test.tolist(),
                                              padding=True,
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')

In [20]:
k = 0
print('Training Comments -->>',Reviews[k])
print('\nInput Ids -->>\n',X_train_encoded['input_ids'][k])
print('\nDecoded Ids -->>\n',tokenizer.decode(X_train_encoded['input_ids'][k]))
print('\nAttention Mask -->>\n',X_train_encoded['attention_mask'][k])
print('\nLabels -->>',Target[k])

Training Comments -->> Timeless musical gem, with Gene Kelly in top form, stylish direction by Vincente Minnelli, and wonderful musical numbers It is great entertainment from start to finish, one of those films that people watch with a smile and say they don't make 'em like they used to But they never did quite make them like this The climactic 25 minute musical sequence without any dialogue is among the most beautiful in film history Movie magic, clearly derived from the heart and soul of everyone involved A must see

Input Ids -->>
 tf.Tensor(
[  101 27768  3315 17070  1010  2007  4962  5163  1999  2327  2433  1010
  2358  8516  4509  3257  2011  6320  2063  8117  9091  2072  1010  1998
  6919  3315  3616  2009  2003  2307  4024  2013  2707  2000  3926  1010
  2028  1997  2216  3152  2008  2111  3422  2007  1037  2868  1998  2360
  2027  2123  1005  1056  2191  1005  7861  2066  2027  2109  2000  2021
  2027  2196  2106  3243  2191  2068  2066  2023  1996 18856  9581 13306
  2423  33

BERT tokenizer menambahkan token-token khusus seperti [CLS], [SEP], dan [MASK] ke sequencenya. Token ini memiliki arti:
1. [CLS] digunakan untuk klasifikasi dan untuk merepresentasikan seluruh input dalam kasus analisis sentimen
2. [SEP] digunakan sebagai pemisah yaitu untuk menandai batas antara kalimat atau segmen yang berbeda
3. [MASK] digunakan untuk masking, seperti menyembunyikan beberapa token dari model selama pra-pelatihan.

BERT tokenizer memberikan output sebagai berikut:

1. input_ids: pengidentifikasi numerik dari token vocabulary

2. token_type_ids: mengidentifikasi segmen atau kalimat mana yang merupakan asal dari setiap token.

3. attention_mask: tanda yang memberi tahu model mengenai token mana yang harus diperhatikan dan mana yang harus diabaikan.

### 4. Membangun Model
Model yang digunakan adalah BERT yang sudah dipretrain.

In [21]:
# Intialize model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model kemudian dicompile

In [22]:
# Compile the model with optimizer, loss function, and metrics
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

Latih modelnya

In [23]:
# Model training
history = model.fit(
    [X_train_encoded['input_ids'], X_train_encoded['token_type_ids'], X_train_encoded['attention_mask']],
    Target,
    validation_data=(
      [X_val_encoded['input_ids'], X_val_encoded['token_type_ids'], X_val_encoded['attention_mask']],y_val),
    batch_size=32,
    epochs=3
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


### 5. Evaluasi Model
Model yang sudah dibangun dan dilatih dievaluasi ketepatan akurasinya.

In [24]:
#Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(
    [X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']],
    y_test
)
print(f'Test loss: {test_loss}, Test accuracy: {test_accuracy}')

Test loss: 0.3450615108013153, Test accuracy: 0.8844799995422363


Model dan tokenizer yang sudah dilatih disimpan sebagai checkpoint.

In [25]:
path = 'saved'

# Save tokenizer and model
tokenizer.save_pretrained(path +'/Tokenizer')
model.save_pretrained(path +'/Model')

Model dan tokenizer yang sudah disimpan dapat langsung diload untuk dipakai

In [36]:
# Load tokenizer and model
bert_tokenizer = BertTokenizer.from_pretrained(path +'/Tokenizer')
bert_model = TFBertForSequenceClassification.from_pretrained(path +'/Model')

Some layers from the model checkpoint at saved/Model were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at saved/Model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [37]:
pred = bert_model.predict(
    [X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']])

# pred is of type TFSequenceClassifierOutput
logits = pred.logits

# Use argmax along the appropriate axis to get the predicted labels
pred_labels = tf.argmax(logits, axis=1)

# Convert the predicted labels to a NumPy array
pred_labels = pred_labels.numpy()

label = {
    1: 'positive',
    0: 'Negative'
}

# Map the predicted labels to their corresponding strings using the label dictionary
pred_labels = [label[i] for i in pred_labels]
Actual = [label[i] for i in y_test]

print('Predicted Label :', pred_labels[:10])
print('Actual Label    :', Actual[:10])

Predicted Label : ['Negative', 'Negative', 'Negative', 'Negative', 'positive', 'Negative', 'positive', 'positive', 'Negative', 'Negative']
Actual Label    : ['Negative', 'Negative', 'Negative', 'Negative', 'positive', 'Negative', 'positive', 'positive', 'positive', 'Negative']


In [38]:
print("Result: \n", classification_report(Actual, pred_labels))

Result: 
               precision    recall  f1-score   support

    Negative       0.87      0.90      0.89      6250
    positive       0.90      0.87      0.88      6250

    accuracy                           0.88     12500
   macro avg       0.88      0.88      0.88     12500
weighted avg       0.88      0.88      0.88     12500



1. Precision adalah ukuran berapa banyak kasus positif yang diprediksi dengan benar dari seluruh kejadian yang diprediksi positif. "Dari seluruh kejadian yang diprediksi positif oleh model, berapa banyak yang benar-benar positif?" Rumusnya adalah $\frac{\text{True Positives}}{\text{True Positives + False Positives}}$

2. Recall adalah ukuran berapa banyak kasus positif yang diprediksi dengan tepat dari semua kasus positif aktual. "Dari semua kejadian positif aktual, berapa banyak yang diprediksi positif oleh model?" Rumusnya adalah $\frac{\text{True Positives}}{\text{True Positives + False Negatives}}$

3. F1 score adalah rata-rata antara precision dan recall. Nilainya berkisar antara 0 dan 1, dengan 1 adalah skor F1 terbaik, yang menunjukkan precision dan recall yang sempurna. Rumusnya adalah $2 \times\frac{\text{Precision } \times \text{ Recall}}{\text{Precision } + \text{ Recall}}$

4. Macro average menghitung metrik secara independen untuk setiap kelas dan kemudian mengambil rata-rata di semua kelas. Misalnya, precision dari macro average akan menghitung precision untuk setiap kelas dan kemudian menghitung rata-rata nilai tersebut.

5. Weighted average mirip dengan macro average, namun memperhitungkan jumlah instance di setiap kelas. Hal ini dibobotkan berdasarkan jumlah sampel pada setiap kelas. Hal ini sangat berguna ketika menangani kumpulan data yang tidak seimbang, di mana beberapa kelas memiliki lebih banyak instance dibandingkan kelas lainnya.

### 6. Prediksi dari Input User

In [47]:
def get_sentiment(Review, Tokenizer=bert_tokenizer, Model=bert_model):
    # Convert review to list
    if not isinstance(Review, list):
        Review = [Review]

    Input_ids, Token_type_ids, Attention_mask = Tokenizer.batch_encode_plus(Review,
                                                                             padding=True,
                                                                             truncation=True,
                                                                             max_length=128,
                                                                             return_tensors='tf').values()
    prediction = Model.predict([Input_ids, Token_type_ids, Attention_mask])

    pred_labels = tf.argmax(prediction.logits, axis=1)

    # Convert the TensorFlow tensor to a NumPy array and then to a list to get the predicted sentiment labels
    pred_labels = [label[i] for i in pred_labels.numpy().tolist()]
    return pred_labels

In [40]:
review_1 ='''This is the best written show I have watched with focus on every single character. The writers have done exceptional job in writing the story and at times goosebumps inducing scenarios which relate to one another in complete unexpected way.
One more best part in the story was that the seriousness and dark mode never lasted long and when it did, it faded away with the best climax I have watched. The ending of the story is beautifully and leisurely written which just warms your heart.
Each and every actor and their characters are awesome as they are done justice on till the end.
And how could I not appreciate the lead female, AJI 3(robot), or Jo Ji A, is beautiful both as human and as the Robot from the start.
It's a heart warming series from start till the end which will make you laugh and cry.'''
get_sentiment(review_1)



['positive']

In [41]:
review_2 = '''Another superb offering!!! From watching mainly English period dramas most of my life, in mid life I've stumbled onto k dramas thanks to a recommendation from, of all people, Welsh &Irish friends, which I suppose goes to show their well deserved global out reach!! i love how gentle and relaxing they are.. This story was something very out of the ordinary.. Also the actors weren't impossibly good looking, the locations, architecture,  costumes things I'm normally so attracted to, were missing: most of the action takes place at night...yet I was so drawn to it & found it very endearing..
I watch with subtitles and the Korean audio and find myself turning up the volume and realising I can't actually understand most of what's been said.. But they really do speak to our hearts. In a world of Western dramas which have lost their way and are now too full of potty mouthed unsavoury characters, too much sex, drugs & drink, no wonder I find myself watching k dramas whenever the TV is turned on..Thank you Netflix for exposing us all to different cultures and providing such high quality entertainment!'''

get_sentiment(review_2)



['positive']

In [42]:
review_3 = '''This movie is not good at all. It has no suspense, and the story is so boring'''
get_sentiment(review_3)



['Negative']

In [45]:
review_4 = '''Half of the story through the end became boring🙄.
I can’t find the climax on the story that should keep the viewers wanting to watch the next episode.
I’m disappointed ☹️'''
get_sentiment(review_4)



['Negative']

In [46]:
review_5 = '''The male lead was toxic as hell the female lead was excellent
there is no balance all the fight in thier relationship always happen as the female lead
gets to endure the selfishness and huge ego of the male lead'''
get_sentiment(review_5)



['Negative']

### 7. Kesimpulan
Model BERT yang digunakan memiliki akurasi dan kinerja yang sangat baik dalam melakukan analisa sentimen review film. Model ini memiliki kemampuan untuk menangani ekspresi linguistik yang kompleks dan bervariasi. Precision, Recall, dan Accuracy dari model ini semuanya berada di sekitar 90%.
