## Import Libraries

Gradio is used for model deployment. Python Sastrawi is a simple library that can convert words with Indonesian affixes into their basic form.

In [1]:
!pip -q install gradio  
!pip -q install PySastrawi

[K     |████████████████████████████████| 659 kB 4.0 MB/s 
[K     |████████████████████████████████| 53 kB 1.8 MB/s 
[K     |████████████████████████████████| 2.0 MB 64.2 MB/s 
[K     |████████████████████████████████| 211 kB 56.5 MB/s 
[K     |████████████████████████████████| 84 kB 2.2 MB/s 
[K     |████████████████████████████████| 54 kB 2.6 MB/s 
[K     |████████████████████████████████| 255 kB 48.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 28.9 MB/s 
[K     |████████████████████████████████| 271 kB 36.1 MB/s 
[K     |████████████████████████████████| 144 kB 49.0 MB/s 
[K     |████████████████████████████████| 94 kB 2.4 MB/s 
[K     |████████████████████████████████| 10.9 MB 18.1 MB/s 
[K     |████████████████████████████████| 58 kB 4.4 MB/s 
[K     |████████████████████████████████| 79 kB 2.6 MB/s 
[K     |████████████████████████████████| 43 kB 975 kB/s 
[K     |████████████████████████████████| 61 kB 211 kB/s 
[K     |███████████████████████████████

NLTK is for tokenization

In [2]:
#Standard Library
import string
import re

#Third-party Library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import word_tokenize
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import gradio as gr

nltk.download('punkt')
pd.set_option("display.max_colwidth", 1000)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Data Collections

The data used is reviews of an inn

In [3]:
!git clone https://github.com/rakkaalhazimi/Data-NLP-Bahasa-Indonesia.git

Cloning into 'Data-NLP-Bahasa-Indonesia'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 41 (delta 9), reused 21 (delta 4), pack-reused 0[K
Unpacking objects: 100% (41/41), done.


In [4]:
df = pd.read_csv("/content/Data-NLP-Bahasa-Indonesia/dataset_tweet_sentiment_cellular_service_provider.csv")
df.head()

Unnamed: 0,Id,Sentiment,Text Tweet
0,1,positive,<USER_MENTION> #BOIKOT_<PROVIDER_NAME> Gunakan Produk Bangsa Sendiri <PROVIDER_NAME>
1,2,positive,"Saktinya balik lagi, alhamdulillah :v <PROVIDER_NAME>"
2,3,negative,Selamat pagi <PROVIDER_NAME> bisa bantu kenapa di dalam kamar sinyal 4G hilang yang 1 lagi panggilan darurat saja <URL>
3,4,negative,Dear <PROVIDER_NAME> akhir2 ini jaringan data lemot banget padahal H+ !!!!
4,5,negative,Selamat malam PENDUSTA <PROVIDER_NAME>


In [5]:
#Check the alias names
#We used "?" to only list characters until the first of ">" 
#Explode() is to show the values
df["Text Tweet"].str.findall("(<.*?>)").explode().value_counts()

<PROVIDER_NAME>    463
<URL>               49
<PRODUCT_NAME>      28
<USER_MENTION>      20
Name: Text Tweet, dtype: int64

In [6]:
#Check emoticon
#\S -> other characters besides "space (/s)"
df["Text Tweet"].str.findall("(:\S+)").explode().value_counts()

:v     4
:.     1
:D     1
:))    1
Name: Text Tweet, dtype: int64

In [7]:
#Check exclamation point
df["Text Tweet"].str.findall("\w+!").explode().value_counts()

buang!           2
mampus!          1
dibaca!          1
ini!             1
susah!           1
nya!             1
Mantap!          1
jancok!          1
Tipu!            1
gitu!            1
ditingkatkan!    1
sebelah!         1
GRATIS!          1
direspon!        1
jelek!           1
lho!             1
mahal!           1
Kenapa!          1
App!             1
BANGET!          1
kota!            1
yess!            1
Name: Text Tweet, dtype: int64

## Text Preprocessing

### Text Cleaning

1. Change some marks to blank space
2. Reduce some excessive spaces into a space
3. Equate uppercase and lowercase letters
4. Separate exclamation points

In [8]:
#restore some punctuations such as below
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
#include some marks
# \. is (.)
punctuations = re.sub(r"[!<_>#:\.]","", string.punctuation)

def punct2wspace(text):
  return re.sub(r"[{}]+".format(punctuations)," ", text)

def normalize_wspace(text):
  return re.sub(r"\s+", " ", text)

def casefolding(text):
  return text.lower()

#(\w+)(!) means that there are two groups by using ()()
def separate_punct(text):
  return re.sub(r"(\w+)(!)", r"\1 \2", text)

In [10]:
separate_punct("nabilah!")

'nabilah !'

### Stemming

Stemming is the removal of affixes in words. In Indonesian, the Literature library is used

In [11]:
#Create Stemmer
factory = StemmerFactory()
stemmer = factory.create_stemmer()

#Stemming Process Example
sentence = "Perekonomian Indonesia sedang dalam pertumbuhan yang membanggakan"
output = stemmer.stem(sentence)

print(output)

print(stemmer.stem("Mereka meniru-nirukannya"))

ekonomi indonesia sedang dalam tumbuh yang bangga
mereka tiru


### Preprocessing

In [12]:
def preprocess_text(text):
  text = punct2wspace(text)
  text = normalize_wspace(text)
  text = casefolding(text)
  text = separate_punct(text)
  #stemmer isn't used because it can remove exclamation mark which is needed in this task
  #text = stemmer.stem(text)
  return text

In [13]:
#Example
preprocess_text("hannaninabilah@gmail.com!")

'hannaninabilah gmail com !'

In [14]:
#Text Illustration of the cleaned data
df["cleaned_text"]=df["Text Tweet"].apply(preprocess_text)
df.head()

Unnamed: 0,Id,Sentiment,Text Tweet,cleaned_text
0,1,positive,<USER_MENTION> #BOIKOT_<PROVIDER_NAME> Gunakan Produk Bangsa Sendiri <PROVIDER_NAME>,<user_mention> #boikot_<provider_name> gunakan produk bangsa sendiri <provider_name>
1,2,positive,"Saktinya balik lagi, alhamdulillah :v <PROVIDER_NAME>",saktinya balik lagi alhamdulillah :v <provider_name>
2,3,negative,Selamat pagi <PROVIDER_NAME> bisa bantu kenapa di dalam kamar sinyal 4G hilang yang 1 lagi panggilan darurat saja <URL>,selamat pagi <provider_name> bisa bantu kenapa di dalam kamar sinyal 4g hilang yang 1 lagi panggilan darurat saja <url>
3,4,negative,Dear <PROVIDER_NAME> akhir2 ini jaringan data lemot banget padahal H+ !!!!,dear <provider_name> akhir2 ini jaringan data lemot banget padahal h !!!!
4,5,negative,Selamat malam PENDUSTA <PROVIDER_NAME>,selamat malam pendusta <provider_name>


## Feature Extractions


### Convert Text to Vector

Count Vector provides a vector containing the number of words in each sentence. The drawback is that word order is not taken into account so that it may lead to bias.

In [15]:
count_vect = CountVectorizer(max_features=10_000, ngram_range=(1,2))
count_repr = count_vect.fit_transform(df["cleaned_text"])
count_repr

<300x3466 sparse matrix of type '<class 'numpy.int64'>'
	with 6228 stored elements in Compressed Sparse Row format>

In [16]:
count_vect.vocabulary_

{'user_mention': 3332,
 'boikot_': 497,
 'provider_name': 2396,
 'gunakan': 1022,
 'produk': 2379,
 'bangsa': 317,
 'sendiri': 2860,
 'user_mention boikot_': 3335,
 'boikot_ provider_name': 498,
 'provider_name gunakan': 2460,
 'gunakan produk': 1023,
 'produk bangsa': 2380,
 'bangsa sendiri': 318,
 'sendiri provider_name': 2861,
 'saktinya': 2683,
 'balik': 299,
 'lagi': 1621,
 'alhamdulillah': 184,
 'saktinya balik': 2684,
 'balik lagi': 300,
 'lagi alhamdulillah': 1622,
 'alhamdulillah provider_name': 191,
 'selamat': 2826,
 'pagi': 2181,
 'bisa': 467,
 'bantu': 327,
 'kenapa': 1475,
 'di': 756,
 'dalam': 639,
 'kamar': 1360,
 'sinyal': 2942,
 '4g': 65,
 'hilang': 1098,
 'yang': 3398,
 'panggilan': 2231,
 'darurat': 704,
 'saja': 2674,
 'url': 3320,
 'selamat pagi': 2829,
 'pagi provider_name': 2184,
 'provider_name bisa': 2428,
 'bisa bantu': 469,
 'bantu kenapa': 328,
 'kenapa di': 1476,
 'di dalam': 764,
 'dalam kamar': 642,
 'kamar sinyal': 1361,
 'sinyal 4g': 2943,
 '4g hilang'

In [17]:
count_repr.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

### TF-IDF

Almost the same as Count Vector, but the calculation method is done by considering the number of words in a sentence and the number of words in the document. So words can have different weights depending on the number of occurrences. The more often it appears in the sentences in the document, the less weight its value indicates that the word does not really affect the meaning of the sentence.

In [18]:
tfidf_vect = TfidfVectorizer(max_features=10_000)
tfidf_repr = tfidf_vect.fit_transform(df["cleaned_text"])
tfidf_repr

<300x1020 sparse matrix of type '<class 'numpy.float64'>'
	with 3145 stored elements in Compressed Sparse Row format>

In [19]:
tfidf_repr.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Model Building

Multinomial (classification) means there are some labels 

In [20]:
logres = LogisticRegression()
multi_nb = MultinomialNB()

## Train and Test Model

In [21]:
target = df["Sentiment"].map({"negative":0, "positive":1})
features = df["cleaned_text"]

### Split Data

In [22]:
train_x, test_x, train_y, test_y = train_test_split(features,target, test_size=0.2, random_state=42)

### Pipeline

In [23]:
#Pipeline Logistic
logres_pipe = Pipeline(
    [
     ("feature_extractions", tfidf_vect),
     ("classifier", logres)
    ]
)

#Pipeline NB
#Used count vector because naive bayes can't accept negative numbers
multinb_pipe = Pipeline(
    [
     ("feature_extractions", count_vect),
     ("classifier", multi_nb)
    ]
)

### Testing Model

In [24]:
multinb_pipe.fit(train_x, train_y)
multinb_pipe.score(test_x, test_y)

0.8666666666666667

In [25]:
logres_pipe.fit(train_x, train_y)
logres_pipe.score(test_x, test_y)

0.85

## Evaluasi Model

In [26]:
#Multimonial Naive Bayes
multinb_report=classification_report(y_true=test_y, y_pred=multinb_pipe.predict(test_x))
print(multinb_report)

              precision    recall  f1-score   support

           0       0.88      0.88      0.88        33
           1       0.85      0.85      0.85        27

    accuracy                           0.87        60
   macro avg       0.87      0.87      0.87        60
weighted avg       0.87      0.87      0.87        60



In [27]:
#Logistic Regression
logres_report=classification_report(y_true=test_y, y_pred=logres_pipe.predict(test_x))
print(logres_report)

              precision    recall  f1-score   support

           0       0.80      0.97      0.88        33
           1       0.95      0.70      0.81        27

    accuracy                           0.85        60
   macro avg       0.88      0.84      0.84        60
weighted avg       0.87      0.85      0.85        60



## Model Deployment

In [33]:
#Change telkomsel|three|im3|smartfren|axis to <PROVIDER_NAME>
re.sub("(telkomsel|three|im3|smartfren|axis)", "<PROVIDER_NAME>", "saya suka telkomsel", flags=re.IGNORECASE) #to ignore case sensitive

'saya suka <PROVIDER_NAME>'

In [35]:
# Change the provider names to the standard form
def namespace_change(text):
  providers=["telkomsel","three","im3","smartfren","axis"]
  return re.sub(
      pattern="({})".format("|".join(providers)),
      repl="<PROVIDER NAME>",
      string=text,
      flags=re.IGNORECASE)

#Change "mention" to the standard form
def mention_change(text):
  return re.sub(
      pattern="@\S+",
      repl="<USER MENTION>",
      string=text,
      flags=re.IGNORECASE)
  
print(namespace_change("nabilah menggunakan Telkomsel"))
print(mention_change("@03nabilah ayo kita pakai telkomsel"))

nabilah menggunakan <PROVIDER NAME>
<USER MENTION> ayo kita pakai telkomsel


In [43]:
#Give explanation for label 0 and 1
sentiment_map={0:"Negatif", 1:"Positif"}

#Create the main function to be run
def predict_sentiment(review):
  review_formatted = namespace_change(review)
  review_formatted = mention_change(review_formatted)
  review_cleaned = preprocess_text(review_formatted)

  prediction = int(logres_pipe.predict([review_cleaned]))
  sentiment = sentiment_map.get(prediction)

  return sentiment

predict_sentiment("@admin kualitas sinyal telkomsel baik")


'Negatif'

In [44]:
 #Create Interface can be done by entering 3 argument keywords in class gr.Interface namely fn, inputs, and outputs

 iface = gr.Interface(
     fn=predict_sentiment,
     inputs=gr.inputs.Textbox(lines=2, placeholder="Review Anda tentang Provider ini"),
     outputs="text")
 iface.launch()

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Running on public URL: https://53859.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces (https://huggingface.co/spaces)


(<fastapi.applications.FastAPI at 0x7f1e56e6c990>,
 'http://127.0.0.1:7863/',
 'https://53859.gradio.app')