## CAPSTONE PROJECT: TWITTER SENTIMENT ANALYSIS ON INDONESIAN CAPITAL RELOCATION PLAN

### Problem Statement:

During independence day celebration on August 17th, 2019, Indonesian president Joko Widodo announced his plan to move the capital city from Jakarta in Java island to a new location in Kalimantan island. The reason given are to save Jakarta from over-crowding and to enhance development in other parts of the country.
This decision is imposed by the government in a totally top-down manner onto a massive population of 280 million. However, public voices are seldom heard on international stage.

The objectives of this project are:<br>
(i) to understand public sentiments toward Indonesian capital relocation plan; <br>
(ii) to investigate the performance of Indonesian-language pretrained models, specifically Huggingfce IndoBert retrieved from this [Link](https://huggingface.co/sarahlintang/IndoBERT) and GPT2-small model retrieved from this [Link](https://huggingface.co/cahya/gpt2-small-indonesian-522M?text=Pulau+Dewata+sering+dikunjungi).<br>

The first model, IndoBert (Indonesian Bert model), is a pre-trained language model based on BERT architecture for the Indonesian Language. It was pre-trained on 16 GB of raw text ~2 B words from [Oscar Corpus](https://oscar-corpus.com/) on three tasks: (i) extractive summarization; (ii) sentiment analysis; and (iii) Part-of-Speech Tagger. The second model, GPT2-small-indonsian-522M, was pre-trained on indonesian Wikipedia using a causal language modeling (CLM) objective. This model is uncased: it does not make a difference between indonesia and Indonesia ([Reference](https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers)).

### Research Questions:

- 1. How did tweet volume and sentiments vary over time?
- 2. Where did most tweet come from and what did they say?
- 3. How did IndoBert models pretrained on Indonesian language perform on: (i) sentiment analysis, and (ii) topic classification?
- 4. Who are the most active twitter users and how are they connected to each other?


### Data sources:

Twitter scraping, using keywords: "ibu kota nusantara" and "jagat nusantara", with maximum number of tweets set to 12,000.

### This project is organized in 4 notebooks:
<ul>
<li>Notebook 1: scraping twitter tweets</li>
<li>Notebook 2: Data cleaning and EDA</li>
<li>Notebook 3: Preprocessing and Modeling 1: IndoBert sentiment analysis</li>
<li>Notebook 4 (on Google Colab): Modeling 2, which consists of the following tasks: <\li>
        <ul>
        <li>- attempt to fine-tune IndoBenchmark IndoBert model</li>
        <li>- evaluating Bert multilingual model's performance</li>
        <li>- topic classification with IndoBert GPT2-small</li>
        
</ul>

Notebook 4 is on Google Colab accessible through this [Link](https://colab.research.google.com/drive/1-YByOO9JaoM5d9Feyd_vfaIQF4kJbu9M#scrollTo=XCZR-ckZNIls)<br>
The project presentation slides is on Tableau interactive dashboard accessible through this [Link](https://public.tableau.com/app/profile/m.alexander8473/viz/capitalrelocationtwitteranalysis/presentation?publish=yes)

### This is Notebook 1

#### Import libraries and modules

In [1]:
import os
import pandas as pd
import itertools
import snscrape.modules.twitter as sntwitter

#### Scraping for tweets

##### (i) Scraping for 12,000 latest tweets with keywords: "jagat nusantara"

In [12]:
# set keywords to "ibu kota nusantara" or " jagat nusantara", time period to between march and oct, and number of max tweets to 12,000

scraped_df = pd.DataFrame(itertools.islice(sntwitter.TwitterSearchScraper(
    'ibu kota nusantara OR jagat nusantara since:2022-03-01 until:2022-10-31').get_items(), 12000)) 

In [13]:
scraped_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 27 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   url               12000 non-null  object             
 1   date              12000 non-null  datetime64[ns, UTC]
 2   content           12000 non-null  object             
 3   renderedContent   12000 non-null  object             
 4   id                12000 non-null  int64              
 5   user              12000 non-null  object             
 6   replyCount        12000 non-null  int64              
 7   retweetCount      12000 non-null  int64              
 8   likeCount         12000 non-null  int64              
 9   quoteCount        12000 non-null  int64              
 10  conversationId    12000 non-null  int64              
 11  lang              12000 non-null  object             
 12  source            12000 non-null  object             
 13  s

In [14]:
# export to csv

scraped_df.to_csv('../data/nusantara.csv', index=False)

In [16]:
scraped_df.tail()

Unnamed: 0,url,date,content,renderedContent,id,user,replyCount,retweetCount,likeCount,quoteCount,...,media,retweetedTweet,quotedTweet,inReplyToTweetId,inReplyToUser,mentionedUsers,coordinates,place,hashtags,cashtags
11995,https://twitter.com/soloposdotcom/status/15003...,2022-03-06 03:04:34+00:00,Sejarah Nusantara yang Jadi Nama Ibu Kota Baru...,Sejarah Nusantara yang Jadi Nama Ibu Kota Baru...,1500306317088411654,"{'username': 'soloposdotcom', 'id': 155169715,...",0,0,0,0,...,,,,,,,,,,
11996,https://twitter.com/penajam_terkini/status/150...,2022-03-06 02:59:04+00:00,Pintu gerbang Ibu Kota Negara Nusantara #IKN #...,Pintu gerbang Ibu Kota Negara Nusantara #IKN #...,1500304932347715585,"{'username': 'penajam_terkini', 'id': 33206990...",0,0,0,0,...,[{'previewUrl': 'https://pbs.twimg.com/media/F...,,,,,,"{'longitude': 113.836655, 'latitude': -2.409401}","{'fullName': 'East Borneo, Indonesia', 'name':...","[IKN, kotanusantara]",
11997,https://twitter.com/PolitikLingkar/status/1500...,2022-03-06 02:41:20+00:00,Pembangunan Ibu Kota Negara (IKN) Nusantara di...,Pembangunan Ibu Kota Negara (IKN) Nusantara di...,1500300468807155717,"{'username': 'PolitikLingkar', 'id': 106940136...",0,0,0,0,...,[{'previewUrl': 'https://pbs.twimg.com/media/F...,,,,,,,,,
11998,https://twitter.com/kumparan/status/1500299637...,2022-03-06 02:38:02+00:00,Kerja dengan berbagai pihak dilakukan guna mem...,Kerja dengan berbagai pihak dilakukan guna mem...,1500299637923598338,"{'username': 'kumparan', 'id': 759692754985242...",0,1,1,0,...,,,,,,,,,[kumparanTECH],
11999,https://twitter.com/HakikiGabut/status/1500295...,2022-03-06 02:19:52+00:00,Kabar gembira bagi ASN yang akan pindah ke Ibu...,Kabar gembira bagi ASN yang akan pindah ke Ibu...,1500295066669416449,"{'username': 'HakikiGabut', 'id': 131880145261...",0,0,0,0,...,[{'previewUrl': 'https://pbs.twimg.com/media/F...,,,,,,,,,


#### Continue to Notebook 2 for data cleaning and EDA