## CAPSTONE PROJECT: TWITTER SENTIMENT ANALYSIS ON INDONESIAN CAPITAL RELOCATION PLAN

### This project is organized in 4 notebooks:
<ul>
<li>Notebook 1: scraping twitter tweets</li>
<li>Notebook 2: Data cleaning and EDA</li>
<li>Notebook 3: Preprocessing and Modeling 1: IndoBert sentiment analysis</li>
<li>Notebook 4 (on Google Colab): Modeling 2, which consists of the following tasks: <\li>
        <ul>
        <li>- attempt to fine-tune IndoBenchmark IndoBert model</li>
        <li>- evaluating Bert multilingual model's performance</li>
        <li>- topic classification with IndoBert GPT2-small</li>
        
</ul>

Notebook 4 is accessible on [Google Colab](https://colab.research.google.com/drive/1-YByOO9JaoM5d9Feyd_vfaIQF4kJbu9M#scrollTo=LRNJPxMre_J1)

### This is Notebook 3

### Import

In [None]:
import pandas as pd
import numpy as np

import torch

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, get_scorer, f1_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

from transformers import BertForSequenceClassification, BertConfig, BertTokenizer, TrainingArguments, Trainer

### Load and inspect data

In [2]:
labeled_tweets_df = pd.read_csv('labeled_tweets.csv')

In [3]:
modeling_data =labeled_tweets_df[['processed_tweets','label']]
modeling_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8622 entries, 0 to 8621
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   processed_tweets  8622 non-null   object 
 1   label             8622 non-null   float64
dtypes: float64(1), object(1)
memory usage: 134.8+ KB


In [13]:
# establilsh X and y

X = modeling_data['processed_tweets']
y = modeling_data['label']

In [14]:
# train test split

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, stratify = y)

In [15]:
# load tokenizer and model

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
model_sl = AutoModel.from_pretrained("sarahlintang/IndoBERT", num_labels=3)

Some weights of the model checkpoint at sarahlintang/IndoBERT were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [17]:
model_sl

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(35000, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [34]:
# apply tokenizer to tweet column

tokenized = modeling_data['processed_tweets'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [35]:
# max length

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)
max_len

66

In [36]:
# add padding

padded = [i + [0]*(max_len-len(i)) for i in tokenized.values]

In [37]:
# establish inputs ids and get last hidden states to fit into classification models

input_ids = torch.tensor(np.array(padded))  

with torch.no_grad():
    last_hidden_states = model_sl(input_ids)

In [38]:
# establish features (X) for modeling

features = last_hidden_states[0][:,0,:].numpy()

In [39]:
# establsih labels (y)

labels = modeling_data['label']

In [40]:
# train test split

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, stratify=labels, random_state=42)

### Model 1. Naive Bayes

In [41]:
# instantiate naive bayes model

params_nb = {'alpha': [0.0001,0.1,10]}
model_nb = MultinomialNB()
grid_nb = GridSearchCV(nb, param_grid=params_nb, cv=5, verbose=1)

In [45]:
# fit naive bayes model

pipe_nb = Pipeline([('Normalizing',MinMaxScaler()),('MultinomialNB',MultinomialNB())])
pipe_nb.fit(train_features, train_labels) 

Pipeline(steps=[('Normalizing', MinMaxScaler()),
                ('MultinomialNB', MultinomialNB())])

In [47]:
# naive bayes cross val score:

scores_nb = cross_val_score(pipe_nb, train_features, train_labels)
print("Multinomial NaiveBayes classifier score: %0.3f (+/- %0.2f)" % (scores_nb.mean(), scores_nb.std() * 2))

Multinomial NaiveBayes classifier score: 0.670 (+/- 0.03)


### Model 2. SVC

In [49]:
# instantiate svc
from sklearn.model_selection import StratifiedKFold
params_svc = {"C": np.linspace(0.0001, 0.1, 10)}
svc= SVC(max_iter=500, class_weight='balanced')
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_svc = GridSearchCV(svc, param_grid=params_svc, cv=5, verbose=1)

In [74]:
# fit svc model

pipe_svc = Pipeline([('Normalizing',MinMaxScaler()),('grid_svc', SVC())])
pipe_svc.fit(train_features, train_labels) 

Pipeline(steps=[('Normalizing', MinMaxScaler()), ('grid_svc', SVC())])

In [75]:
# svc cross val score:

scores_svc = cross_val_score(pipe_svc, train_features, train_labels)
print("SVM classifier average score: %0.3f (+/- %0.2f)" % (scores_svc.mean(), scores_svc.std() * 2))

SVM classifier average score: 0.724 (+/- 0.01)


**Observation:**<br>
Looking at F1 score, it is observed that SVM classifier is the winner with score of 72.5% and Naive Bayes is performing at 65.9%


### Model 3. Bert sentiment analysis

In [79]:
from transformers import pipeline



In [81]:
pipe_bert = pipeline('sentiment-analysis')

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

In [86]:
# make prediction using bert sentiment analysis

pred_bert = []
for i in modeling_data['processed_tweets']:
    pred = pipe_bert(i)
    pred_bert.append(pred)

In [104]:
modeling_data['predicted_label']=pred_bert
modeling_data.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,processed_tweets,label,predicted_label
0,sensasi gimana ya,1.0,"[{'label': 'POSITIVE', 'score': 0.752638816833..."
1,metaverse bernama jagat merasakan sensasi,1.0,"[{'label': 'NEGATIVE', 'score': 0.972690463066..."
2,metaversa sensasi bernama jagat,1.0,"[{'label': 'NEGATIVE', 'score': 0.987670481204..."
3,tahap metaverse bernama jagat merasakan sensasi,1.0,"[{'label': 'NEGATIVE', 'score': 0.959913432598..."
4,tahap awalnamun merasakan sensasi metaverse be...,1.0,"[{'label': 'NEGATIVE', 'score': 0.990127384662..."
5,metaverse jagat diluncurkan sensasi,1.0,"[{'label': 'NEGATIVE', 'score': 0.987223684787..."
6,metaverse bernama jagat masyarakat merasakan s...,1.0,"[{'label': 'NEGATIVE', 'score': 0.983072876930..."
7,metaverse jagat merasakan sensasi,1.0,"[{'label': 'NEGATIVE', 'score': 0.984077095985..."
8,nama metaverse jagat,0.0,"[{'label': 'NEGATIVE', 'score': 0.998415887355..."
9,tahap bary merasakan sensasi,1.0,"[{'label': 'NEGATIVE', 'score': 0.669706046581..."


**Observation:**<br>
Looks like most of them are mis-labeled.

### Model 5. Fine-tuning IndoBert

### Model 6. Bert multilingual pretrained model

### Model 7. IndoBert GPT2-small-indonesian-522M

Models 5, 6 and 7 are accessible on [Google Colab](https://colab.research.google.com/drive/1-YByOO9JaoM5d9Feyd_vfaIQF4kJbu9M#scrollTo=sZVDIx-VEkdd&uniqifier=1)

### Conclusion

IndoBert models perform decently well with unseen data on the following 2 tasks:<br>
(i) sentiment analysis with [IndoBert](https://huggingface.co/sarahlintang/IndoBERT)
(ii) topic classification with [IndoBert GPT2-small](https://huggingface.co/cahya/gpt2-small-indonesian-522M?text=Pulau+Dewata+sering+dikunjungi)<br>

These models perform much better than Bert multilingual model trained on 102 languages including Indonesian, with F1 score comparison of 0.725 (IndoBert) vs 0.343 (multilingual Bert).<br>
The IndoBert model needs to be supported by Indonesian language python pre-processing packages such as nltk indonesian and Sastrawi stemmer.

From network visualization on Gephi, it can be identified which twitter users are most active and have most connections. The visualization also reveals who are interested in the capital relocation project and who are the main actors within the project's social network.


### Limitations

- The project analysis is based on data collected within limited amount of data and limited period of time.<br>
- The project does not consider sentiments from non-twitter users.


### Recommendations

Building up on this project, further investigations can be conducted to:<br>
    - investigate twitter social network and their political interests<br>
    - identify social media buzzers and paid sentiments that they twitted<br>
    - extend data to cover tweets from 2019 when the capital relocation plan was first announced<br>
    - explore other Indonesian language pretrained models and other foreign language models<br>
    - fine-tune models


### Links:

To [Tableau interactive dashboard presentation slides](https://public.tableau.com/app/profile/m.alexander8473/viz/capitalrelocationtwitteranalysis/presentation?publish=yes)<br>
To [Google Colab](https://colab.research.google.com/drive/1-YByOO9JaoM5d9Feyd_vfaIQF4kJbu9M#scrollTo=LRNJPxMre_J1) notebook 4.