Preprocessing Data for Sentiment Analysis (Politician) <br>
Prepared by Fad

We will check if a GPU is available for us. We use TensorFlow GPU to achieve the fastest possible time when applying deep learning models, which is done by using a CUDA-enabled GPU, like RTX 3060 Ti in our case.

In [None]:
import tensorflow as tf
from tensorflow.python.client import device_lib

tf.debugging.set_log_device_placement(True)
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 16093434038040944662
 xla_global_id: -1,
 name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 5748293632
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 11222872952820174810
 physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 3060 Ti, pci bus id: 0000:09:00.0, compute capability: 8.6"
 xla_global_id: 416903419]

In [None]:
import pandas as pd

df = pd.read_csv("Politician_Raw.csv")

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,...,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,Name,Political Party
0,0,1.596499e+18,1.596499e+18,2022-11-26 13:38:11 UTC,2022-11-26,13:38:11,0.0,29370069.0,lim_weijiet,Lim Wei Jiet,...,,,,"{'user_id': None, 'username': None}",,,,,,
1,1,1.595682e+18,1.595681e+18,2022-11-24 07:32:34 UTC,2022-11-24,07:32:34,0.0,29370069.0,lim_weijiet,Lim Wei Jiet,...,,,,"{'user_id': '29370069', 'username': 'lim_weiji...",,,,,,
2,2,1.595681e+18,1.595681e+18,2022-11-24 07:29:21 UTC,2022-11-24,07:29:21,0.0,29370069.0,lim_weijiet,Lim Wei Jiet,...,,,,"{'user_id': None, 'username': None}",,,,,,
3,3,1.595674e+18,1.595674e+18,2022-11-24 07:00:48 UTC,2022-11-24,07:00:48,0.0,29370069.0,lim_weijiet,Lim Wei Jiet,...,,,,"{'user_id': None, 'username': None}",,,,,,
4,4,1.595653e+18,1.595653e+18,2022-11-24 05:38:01 UTC,2022-11-24,05:38:01,0.0,29370069.0,lim_weijiet,Lim Wei Jiet,...,,,,"{'user_id': None, 'username': None}",,,,,,


In [None]:
df.columns

Index(['Unnamed: 0', 'id', 'conversation_id', 'created_at', 'date', 'time',
       'timezone', 'user_id', 'username', 'name', 'place', 'tweet', 'language',
       'mentions', 'urls', 'photos', 'replies_count', 'retweets_count',
       'likes_count', 'hashtags', 'cashtags', 'link', 'retweet', 'quote_url',
       'video', 'thumbnail', 'near', 'geo', 'source', 'user_rt_id', 'user_rt',
       'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src',
       'trans_dest', 'Name', 'Political Party'],
      dtype='object')

In [None]:
# Get columns that are interesting
df = df[['date', 'time', 'name', 'tweet', 'likes_count', 'retweets_count', 'mentions', 'hashtags', 'replies_count']]

In [None]:
df.head()

Unnamed: 0,date,time,name,tweet,likes_count,retweets_count,mentions,hashtags,replies_count
0,2022-11-26,13:38:11,Lim Wei Jiet,💯💯💯,23.0,1.0,[],[],0.0
1,2022-11-24,07:32:34,Lim Wei Jiet,Saya sentiasa mendoakan Dato' Seri dapat menge...,28.0,4.0,[],[],0.0
2,2022-11-24,07:29:21,Lim Wei Jiet,"Tahniah Dato' Seri Anwar Ibrahim, Perdana Ment...",271.0,43.0,[],[],7.0
3,2022-11-24,07:00:48,Lim Wei Jiet,"Malaysians, your vote absolutely mattered in G...",2268.0,758.0,[],[],19.0
4,2022-11-24,05:38:01,Lim Wei Jiet,Yes!,153.0,14.0,[],[],2.0


#### Check for NaN/Null Values

In [None]:
df.isna().sum()

date              33
time              33
name              33
tweet             33
likes_count       33
retweets_count    33
mentions          33
hashtags          33
replies_count     33
dtype: int64

In [None]:
# Drop rows that contain NaN in the 'tweet' column
df = df.dropna()

df.isna().sum()

date              0
time              0
name              0
tweet             0
likes_count       0
retweets_count    0
mentions          0
hashtags          0
replies_count     0
dtype: int64

In [None]:
df.shape

(5684, 9)

### NLP Preprocessing

We know that social media texts from Twitter is very noisy and we want to clean as much as possible to make our machines understand the structure of sentence much better. Using Malaya, we can standardize our text preprocessing,

Things Malaya can do
- Replace special words into tokens to reduce dimension curse. rm10k become <money>.

- Put tags for special words, #drmahathir become <hashtag> drmahathir </hashtag>.

- Expand english contractions.

- Expand hashtags, #drmahathir become dr mahathir, required a segmentation callable.

- Put emoji tags if provide demoji object.
    
Things it can't do
     
- Non-Malay translation (Needed)
- Punctuations removal
- Tatabahasa fix

#### Translation Issue

Malaya does provide a translation API to detect foreign languages, but it doesn't work in our setup due to a platform issue and it also doesn't support a wide range of languages. Since our data only has 7k rows, we can just use google sheet to perform the translation.

We upload our data into google sheet. Create a new column called 'tweet_translated' and apply translation function. Works like a charm!

![google sheet](https://raw.githubusercontent.com/Muhd-Farhad/PRU15-Twitter-Sentiment-Analysis/main/img/google-sheet.png)

In [None]:
# Export our data to feed into google sheet
df.to_csv("politician_raw.xlsx")

Read our new dataframe that have our translated tweets.

In [None]:
# Read our translated tweet
new_df = pd.read_excel('politician_translated.xlsx')

In [None]:
# check to make sure the total row is the same as our original df
new_df.tail()

Unnamed: 0.1,Unnamed: 0,date,time,name,tweet,tweet_translated,likes_count,retweets_count,mentions,hashtags,replies_count
5679,5712,2022-11-01,23:53:52,Hannah Yeoh,Macam mana poster boy BN Ismail Sabri boleh di...,Macam mana poster boy BN Ismail Sabri boleh di...,953,326,[],['posterboytakadakuasa'],204
5680,5713,2022-11-01,16:26:01,Hannah Yeoh,Zahid,Zahid,102,10,[],[],8
5681,5714,2022-11-01,11:55:01,Hannah Yeoh,Mat Sabu Jr 👏👏,Mat Sabu Jr 👏👏,248,28,[],[],11
5682,5715,2022-11-01,00:25:26,Hannah Yeoh,"Pengundi Pahang di KL dan Selangor, kami memer...","Pengundi Pahang di KL dan Selangor, kami memer...",72,35,[],[],5
5683,5716,2022-11-01,00:18:32,Hannah Yeoh,"Semasa PKP dahulu, Menteri PAS ini tak ikut SO...","Semasa PKP dahulu, Menteri PAS ini tak ikut SO...",232,101,[],[],31


In [None]:
new_df = new_df.drop('Unnamed: 0', axis=1)

Now we perform normal preprocessing on our translated text.

In [None]:
import malaya

# Load segmenter
segmenter = malaya.segmentation.transformer(model = 'small', quantized = False)

# Create segmention function
segmenter_func = lambda x: segmenter.greedy_decoder([x])[0]

# Load preprocessing instance and use segmenter
preprocessing = malaya.preprocessing.preprocessing(segmenter = segmenter_func, annotate = [])

#### What is segmenter?

Common problem for social media texts, there are missing spaces in the text, so text segmentation can help us fix the space as such

- huseinsukamakan ayam,dia sgtrisaukan -> husein suka makan ayam, dia sgt risaukan. <br> <br>

- drmahathir sangat menekankan budaya budakzamansekarang -> dr mahathir sangat menekankan budaya budak zaman sekarang.<br><br>

- ceritatunnajibrazak -> cerita tun najib razak.<br><br>

- TunM sukakan -> Tun M sukakan.<br><br>

In [None]:
# Create function to apply on dataframe
def preprocess_string(string):
    return ' '.join(preprocessing.process(string))

In [None]:
%%time
new_df['tweet_processed'] = new_df['tweet_translated'].apply(preprocess_string)

CPU times: total: 3min 22s
Wall time: 2min 25s


### Load our Malaya sentiment model

In [None]:

# view available transformer
malaya.sentiment.available_transformer()

Unnamed: 0,Size (MB),Quantized Size (MB),macro precision,macro recall,macro f1-score
bert,425.6,111.0,0.93182,0.93442,0.93307
tiny-bert,57.4,15.4,0.9339,0.93141,0.93262
albert,48.6,12.8,0.91228,0.91929,0.9154
tiny-albert,22.4,5.98,0.91442,0.91646,0.91521
xlnet,446.6,118.0,0.9239,0.92629,0.92444
alxlnet,46.8,13.3,0.91896,0.92589,0.92198


We use BERT because it offers the highest accuracy. If you prefer a smaller size with slightly lower accuracy, we also have Tiny-BERT as an option.

In [None]:
model = malaya.sentiment.transformer(model = 'bert', quantized = False)

In [None]:
# Create function for prediction
def predict_tweet(string):
    return model.predict([string])[0]

In [None]:
%%time

# Apply the function to the 'tweet_processed' column and store the results in a new column 'prediction'
new_df['prediction'] = new_df['tweet_processed'].apply(predict_tweet)

CPU times: total: 1min 18s
Wall time: 1min 14s


In [None]:
new_df.head(10)

Unnamed: 0,date,time,name,tweet,likes_count,retweets_count,mentions,hashtags,replies_count,prediction
0,2022-11-26,13:38:11,Lim Wei Jiet,💯💯💯,23,1,[],[],0,neutral
1,2022-11-24,07:32:34,Lim Wei Jiet,Saya sentiasa mendoakan Dato' Seri dapat menge...,28,4,[],[],0,positive
2,2022-11-24,07:29:21,Lim Wei Jiet,"Tahniah Dato' Seri Anwar Ibrahim, Perdana Ment...",271,43,[],[],7,positive
3,2022-11-24,07:00:48,Lim Wei Jiet,"Malaysians, your vote absolutely mattered in G...",2268,758,[],[],19,positive
4,2022-11-24,05:38:01,Lim Wei Jiet,Yes!,153,14,[],[],2,neutral
5,2022-11-19,15:13:11,Lim Wei Jiet,"Unfortunately, we did not manage to win Tanjun...",5668,803,[],[],208,positive
6,2022-11-19,09:19:30,Lim Wei Jiet,One more hour!\n\nYour vote will make a differ...,277,63,[],[],2,positive
7,2022-11-19,07:59:24,Lim Wei Jiet,Semangat gila pak guard di SJK(C) Bin Chong. h...,83,7,[],[],1,neutral
8,2022-11-19,07:51:38,Lim Wei Jiet,https://t.co/ApdMlQHPD6,17,0,[],[],0,neutral
9,2022-11-19,06:24:36,Lim Wei Jiet,Melawat pusat-pusat mengundi di sekitar Pekan ...,56,4,[],[],2,positive


Sort our columns and remove unnecessary columns

In [None]:
new_df = new_df.reindex(columns=['date','time','name','tweet','likes_count','retweets_count','mentions','hashtags','replies_count','prediction'])

Export our clean data

In [None]:
# Specify utf-8-sig encoding to preserve emoji
new_df.to_csv('Politicians_Labeled.csv', index=False, encoding='utf-8-sig')
# Export in excel worksheet
new_df.to_excel('Politicians_Labeled.xlsx')