Preprocessing Data for Sentiment Analysis (Netizen) <br>
Prepared by Fad

We will check if a GPU is available for us. We use TensorFlow GPU to achieve the fastest possible time when applying deep learning models, which is done by using a CUDA-enabled GPU, like RTX 3060 Ti in our case.

In [None]:
import tensorflow as tf
from tensorflow.python.client import device_lib

tf.debugging.set_log_device_placement(True)
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 11704794323389237671
 xla_global_id: -1,
 name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 5748293632
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 13922877276532335009
 physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 3060 Ti, pci bus id: 0000:09:00.0, compute capability: 8.6"
 xla_global_id: 416903419]

In [None]:
import pandas as pd

# read the csv file
df = pd.read_csv("Netizen_Raw.csv", low_memory=False)

In [None]:
# Get columns that are interesting
df = df[['date', 'time', 'name', 'tweet', 'likes_count', 'retweets_count', 'mentions', 'hashtags', 'replies_count']]

In [None]:
df = df.dropna()

### Remove mentions

In [None]:
%%time
import re

# define a function to remove mentions and links
def remove_mentions_links(text):
    try:
        text = re.sub(r'@\w+', '', text) # remove mentions
        text = re.sub(r'http\S+', '', text) # remove links
        return text
    except:
        return ""

# apply the function to the tweet text column
df["tweet_processed"] = df["tweet"].apply(remove_mentions_links,)

CPU times: total: 1.05 s
Wall time: 1.05 s


In [None]:
df.head()

Unnamed: 0,date,time,name,tweet,likes_count,retweets_count,mentions,hashtags,replies_count,tweet_processed
0,2022-11-29,23:39:57,WaWan,@fahmi_fadzil Kemon YB jgn jd sengal.. dulu ka...,0,0,['fahmi_fadzil'],['kerajaanpintubelakang'],0,Kemon YB jgn jd sengal.. dulu kala laungan #k...
1,2022-11-29,22:33:07,M🔺l🔺y🔺 Tanahairku 🇲🇾,I hope it’s not si ponorogo @presidenumnomy #z...,1,0,['presidenumnomy'],"['zahidhamidiletakjawatan', 'umno', 'barisanna...",0,I hope it’s not si ponorogo #zahidhamidiletak...
2,2022-11-29,20:50:34,RameshRaoAKS******,Kencing Dan puak Ularmark @DPPMalaysia @titm_o...,20,8,"['dppmalaysia', 'titm_official', 'abdulhadiawa...","['pakatanharapan', 'barisannasional', 'anwarib...",4,Kencing Dan puak Ularmark berpisah Tiada!!\...
3,2022-11-29,19:28:40,RameshRaoAKS******,"Jika Kamu Sering Dimainkan,\nJika Kamu Sering ...",8,3,"['najibrazak', 'drzahidhamidi']","['anwaribrahim', 'anwaribrahimpm10', 'bossku',...",2,"Jika Kamu Sering Dimainkan,\nJika Kamu Sering ..."
4,2022-11-29,18:59:19,New Straits Times,#NSTTV Working with Pakatan Harapan (#PH) and ...,2,2,[],"['nsttv', 'ph', 'barisannasional', 'umno']",1,#NSTTV Working with Pakatan Harapan (#PH) and ...


### Load our Malaya sentiment model

In [None]:
import malaya

# view available transformer
malaya.sentiment.available_transformer()

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,Size (MB),Quantized Size (MB),macro precision,macro recall,macro f1-score
bert,425.6,111.0,0.93182,0.93442,0.93307
tiny-bert,57.4,15.4,0.9339,0.93141,0.93262
albert,48.6,12.8,0.91228,0.91929,0.9154
tiny-albert,22.4,5.98,0.91442,0.91646,0.91521
xlnet,446.6,118.0,0.9239,0.92629,0.92444
alxlnet,46.8,13.3,0.91896,0.92589,0.92198


We use BERT because it offers the highest accuracy. If you prefer a smaller size with slightly lower accuracy, we also have Tiny-BERT as an option.

In [None]:
model = malaya.sentiment.transformer(model = 'tiny-bert', quantized = True)

Load quantized model will cause accuracy drop.
Downloading: 100%|██████████| 15.1M/15.1M [00:03<00:00, 4.22MB/s]


In [None]:
# Create function for prediction
def predict_tweet(string):
    return model.predict([string])[0]

In [None]:
%%time

# Apply the function to the 'tweet_processed' column and store the results in a new column 'prediction'
df['prediction'] = df['tweet_processed'].apply(predict_tweet)

CPU times: total: 6h 43min 50s
Wall time: 1h 33min 59s


In [None]:
df.head(10)

Unnamed: 0,date,time,name,tweet,likes_count,retweets_count,mentions,hashtags,replies_count,tweet_processed,prediction
0,2022-11-29,23:39:57,WaWan,@fahmi_fadzil Kemon YB jgn jd sengal.. dulu ka...,0,0,['fahmi_fadzil'],['kerajaanpintubelakang'],0,Kemon YB jgn jd sengal.. dulu kala laungan #k...,negative
1,2022-11-29,22:33:07,M🔺l🔺y🔺 Tanahairku 🇲🇾,I hope it’s not si ponorogo @presidenumnomy #z...,1,0,['presidenumnomy'],"['zahidhamidiletakjawatan', 'umno', 'barisanna...",0,I hope it’s not si ponorogo #zahidhamidiletak...,negative
2,2022-11-29,20:50:34,RameshRaoAKS******,Kencing Dan puak Ularmark @DPPMalaysia @titm_o...,20,8,"['dppmalaysia', 'titm_official', 'abdulhadiawa...","['pakatanharapan', 'barisannasional', 'anwarib...",4,Kencing Dan puak Ularmark berpisah Tiada!!\...,negative
3,2022-11-29,19:28:40,RameshRaoAKS******,"Jika Kamu Sering Dimainkan,\nJika Kamu Sering ...",8,3,"['najibrazak', 'drzahidhamidi']","['anwaribrahim', 'anwaribrahimpm10', 'bossku',...",2,"Jika Kamu Sering Dimainkan,\nJika Kamu Sering ...",positive
4,2022-11-29,18:59:19,New Straits Times,#NSTTV Working with Pakatan Harapan (#PH) and ...,2,2,[],"['nsttv', 'ph', 'barisannasional', 'umno']",1,#NSTTV Working with Pakatan Harapan (#PH) and ...,negative
5,2022-11-29,17:14:53,Cik Asmaliah,@BilliBear3 @malaysiakini barisan nasional is ...,0,0,"['billibear3', 'malaysiakini']",[],0,barisan nasional is a scammer party. !,negative
6,2022-11-29,16:54:40,Amirὒl 🇲🇾,Rindu kerajaan barisan nasional,0,0,[],[],0,Rindu kerajaan barisan nasional,negative
7,2022-11-29,15:12:24,Sukma,"Sama je. PH-DAP, PN-PAS, UMNO-Barisan Nasional.",0,0,[],[],0,"Sama je. PH-DAP, PN-PAS, UMNO-Barisan Nasional.",negative
8,2022-11-29,15:09:14,🇲🇾Astro AWANI🇲🇾,Amanah has decided to give way to the Barisan ...,30,14,[],['awanitonight'],1,Amanah has decided to give way to the Barisan ...,negative
9,2022-11-29,15:03:58,BFM News,1. The name of Pakatan Harapan candidate Fadzl...,38,8,[],[],2,1. The name of Pakatan Harapan candidate Fadzl...,negative


Sort our columns and remove unnecessary columns

In [None]:
df = df.reindex(columns=['date','time','name','tweet','likes_count','retweets_count','mentions','hashtags','replies_count','prediction'])

Export our clean data

In [None]:
df.to_csv('Netizen_Labeled.csv', index=False, encoding='utf-8-sig')

In [None]:
df.head()

Unnamed: 0,date,time,name,tweet,likes_count,retweets_count,mentions,hashtags,replies_count,prediction
0,2022-11-29,23:39:57,WaWan,@fahmi_fadzil Kemon YB jgn jd sengal.. dulu ka...,0,0,['fahmi_fadzil'],['kerajaanpintubelakang'],0,negative
1,2022-11-29,22:33:07,M🔺l🔺y🔺 Tanahairku 🇲🇾,I hope it’s not si ponorogo @presidenumnomy #z...,1,0,['presidenumnomy'],"['zahidhamidiletakjawatan', 'umno', 'barisanna...",0,negative
2,2022-11-29,20:50:34,RameshRaoAKS******,Kencing Dan puak Ularmark @DPPMalaysia @titm_o...,20,8,"['dppmalaysia', 'titm_official', 'abdulhadiawa...","['pakatanharapan', 'barisannasional', 'anwarib...",4,negative
3,2022-11-29,19:28:40,RameshRaoAKS******,"Jika Kamu Sering Dimainkan,\nJika Kamu Sering ...",8,3,"['najibrazak', 'drzahidhamidi']","['anwaribrahim', 'anwaribrahimpm10', 'bossku',...",2,positive
4,2022-11-29,18:59:19,New Straits Times,#NSTTV Working with Pakatan Harapan (#PH) and ...,2,2,[],"['nsttv', 'ph', 'barisannasional', 'umno']",1,negative
