# Bangkit 2022 Capstone Project
This project aims to classify the sentiment of a text as either positive or negative. It involves transfer learning using IndoBERT. The data collection is done through a combination of semi-manual scraping, automated scraping, and open data from the internet.

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
from transformers import BertTokenizer, TFBertModel

### Training data preprocessing:

In [3]:
df_gmaps = pd.read_csv("./data/google_maps.csv")
df_tokped = pd.read_csv("./data/dataset_review_tokped_labelled.csv")

df_tokped.head()

Unnamed: 0,Review,Rating,Sentiment
0,enak kuacinya,5,positive
1,pengiriman cepat packing bagus sesuai pesanan ...,5,positive
2,pengemasan luar biasa baik untuk rasa menurut ...,4,negative
3,terimakasih min,5,neutral
4,udah order untuk kesekian kali jos,5,neutral


In [4]:
df_tokped.drop("Rating", axis="columns", inplace=True)
df_tokped.head()

Unnamed: 0,Review,Sentiment
0,enak kuacinya,positive
1,pengiriman cepat packing bagus sesuai pesanan ...,positive
2,pengemasan luar biasa baik untuk rasa menurut ...,negative
3,terimakasih min,neutral
4,udah order untuk kesekian kali jos,neutral


In [5]:
df_tokped.columns = ["text", "label"]
df_tokped.head()

Unnamed: 0,text,label
0,enak kuacinya,positive
1,pengiriman cepat packing bagus sesuai pesanan ...,positive
2,pengemasan luar biasa baik untuk rasa menurut ...,negative
3,terimakasih min,neutral
4,udah order untuk kesekian kali jos,neutral


In [6]:
# change "positive" or "neutral" to 1, change "negative" to 0
df_tokped["label"] = df_tokped["label"].map(lambda row: 0 if row == "negative" else 1)
df_tokped.head()

Unnamed: 0,text,label
0,enak kuacinya,1
1,pengiriman cepat packing bagus sesuai pesanan ...,1
2,pengemasan luar biasa baik untuk rasa menurut ...,0
3,terimakasih min,1
4,udah order untuk kesekian kali jos,1


In [7]:
df_tokped["label"].value_counts()

1    3488
0     572
Name: label, dtype: int64

In [8]:
df = pd.concat([df_gmaps, df_tokped], ignore_index=True)

df.head()

Unnamed: 0,text,label
0,Tempat yang enak untuk hang out bersama teman ...,1
1,Tempatnya nyaman krn smoking areanya benar2 te...,1
2,Tempat ternyaman dan deket banget sama kantor....,1
3,"Tempatnya luas bgtt, nyaman kalo buat nugas ku...",1
4,Tempatnya cukup luas. Bisa blocking. Instagram...,1


### Train test split

In [9]:
# shuffle training data
df = df.sample(frac=1, ignore_index=True)

df.head()

Unnamed: 0,text,label
0,agak lama sih ngirimnya,0
1,kurang sedap kurang kerasa ikan nya,0
2,selesai dgn baik dan dpt bonus pie tq seller,1
3,pengiriman cepat sesuai pesanan recommended se...,1
4,ada bonus nya juga,1


In [10]:
# train-valid-test split 70-20-10
train_size = int(len(df) * 0.7)
valid_size = int(len(df) * 0.2)

df_train = df[:train_size]
df_valid = df[train_size:train_size + valid_size]
df_test = df[train_size + valid_size:]

print(len(df_train))
print(len(df_valid))
print(len(df_test))

2982
852
426


In [11]:
x_train = df_train["text"].values
y_train = df_train["label"].values

x_valid = df_valid["text"].values
y_valid = df_valid["label"].values

x_test = df_test["text"].values
y_test = df_test["label"].values

### Modelling

In [13]:
# download the IndoBERT pre-trained model
model_name='cahya/bert-base-indonesian-522M'
bert_tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = TFBertModel.from_pretrained(model_name)
bert_model.trainable = False

Some layers from the model checkpoint at cahya/bert-base-indonesian-522M were not used when initializing TFBertModel: ['mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at cahya/bert-base-indonesian-522M.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [14]:
# tokenize the training data using bert tokenizer
x_train_tokenized = bert_tokenizer(x_train.tolist(), truncation=True, max_length=100, padding=True, return_tensors="tf")
x_valid_tokenized = bert_tokenizer(x_valid.tolist(), truncation=True, max_length=100, padding=True, return_tensors="tf")
x_test_tokenized = bert_tokenizer(x_test.tolist(), truncation=True, max_length=100, padding=True, return_tensors="tf")

### Evaluation

In [16]:
# Load the TFLite model
interpreter = tf.lite.Interpreter(model_path="model.tflite")
interpreter.allocate_tensors()

# Get the input and output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Prepare the input data
x_test_tokenized_input = x_test_tokenized.input_ids.numpy()

# Perform the prediction
predictions = []
for i in range(len(x_test_tokenized_input)):
    input_data = x_test_tokenized_input[i]
    input_data = input_data.reshape(1, -1)
    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    predictions.append(output_data[0][0])

# Convert the predictions to binary labels
predictions = np.array(predictions)
binary_predictions = np.round(predictions).astype(int)

# Print the predictions
for i in range(len(x_test)):
    print(f"Review: {x_test[i]}")
    print(f"Predicted Label: {binary_predictions[i]}")
    print()


Review: second order
Predicted Label: 1

Review: barangnya puas
Predicted Label: 1

Review: pertama kali beli disini respon dan pengiriman cepat packing pun rapi dengan dibungkus double dan aman terima kasih
Predicted Label: 1

Review: tlong di beri rasa yg beda2 bbrp kali order dpt nya rasa itu2 aja
Predicted Label: 0

Review: puas bgt belanja jus disini pengiriman rapi dan cepat seller sgt ramah dan selalu membalasa chat dgn cepat hrg pun murah pokoknya pelayanannya bagus rekomen bgt beli disini
Predicted Label: 1

Review: Tempat sdh bagus sayang di sayang pelayanan nya gak bagus,👎👎👎👎, Gak bisa order saat pergantian shift, lama sekali kebanyakan bercanda, terlalu cuek terhadap customer
Predicted Label: 1

Review: bagus semoga kwalitas tetap di utamakan biar jadi langganan tetap
Predicted Label: 1

Review: semua item sesuai orderan seperti biasa selalu prosesnya cepat packing dan kirim rapi untuk rasa dan kualitas rotinya keluarga kita cocok chewy tp lembut teksturnya challa selalu ma