## CAPSTONE PROJECT: TWITTER SENTIMENT ANALYSIS ON INDONESIAN CAPITAL RELOCATION PLAN

### This is a data preparation notebook

This Notebook has 2 parts:<br>
- Part 1: Data preparation for network visualization on Gephi
- Part 2: Data preparation for Tableau dashboard presentation

### Import libraries and modules

In [1]:
import pandas as pd
import numpy as np

### Import data and inspect

In [5]:
# import raw data

unprocessed = pd.read_csv('../data/nusantara.csv')

In [6]:
unprocessed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 27 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   url               12000 non-null  object 
 1   date              12000 non-null  object 
 2   content           12000 non-null  object 
 3   renderedContent   12000 non-null  object 
 4   id                12000 non-null  int64  
 5   user              12000 non-null  object 
 6   replyCount        12000 non-null  int64  
 7   retweetCount      12000 non-null  int64  
 8   likeCount         12000 non-null  int64  
 9   quoteCount        12000 non-null  int64  
 10  conversationId    12000 non-null  int64  
 11  lang              12000 non-null  object 
 12  source            12000 non-null  object 
 13  sourceUrl         12000 non-null  object 
 14  sourceLabel       12000 non-null  object 
 15  outlinks          4546 non-null   object 
 16  tcooutlinks       4546 non-null   object

In [None]:
# filter out data with reply information

# network = network.loc[network['inReplyToUser'].notnull(),:]

In [None]:
# save file to csv for future use (this notebook was worked on on different days, so I exported the data first to be worked on the next day)

# network.to_csv('./data/network_gep.csv',index=False)

#### Part 1: Data prep for Gephi
In this part, the goal is to produce 2 dataframes which are nodes and edges. They need to have source and target columns as minmal requirements.

In [6]:
# import network data

network=pd.read_csv('../data/network_gep.csv')
network.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2360 entries, 0 to 2359
Columns: 113 entries, user_id to Unnamed: 112
dtypes: float64(75), object(38)
memory usage: 2.0+ MB


In [9]:
network.columns

Index(['user_id', 'user_name', 'Unnamed: 2', 'reply_name', 'Unnamed: 4',
       'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9',
       ...
       'Unnamed: 103', 'Unnamed: 104', 'Unnamed: 105', 'Unnamed: 106',
       'Unnamed: 107', 'Unnamed: 108', 'Unnamed: 109', 'Unnamed: 110',
       'Unnamed: 111', 'Unnamed: 112'],
      dtype='object', length=113)

In [10]:
network=network[['user_id','user_name','Unnamed: 2','reply_name']]

In [14]:
# establish source column

network['source']=network['user_id']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [16]:
# establish target column

network['target']=network['Unnamed: 2']
network.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,user_id,user_name,Unnamed: 2,reply_name,source,target
0,1.03587e+18,"'hyacinthyou_',",1.03587e+18,'hyacinthyou_',1.03587e+18,1.03587e+18
1,1.58299e+18,"'alparezi11',",377233100.0,'AbdinegaraSetia',1.58299e+18,377233100.0
2,1.31029e+18,"'Penjagajokowi1',",1.31029e+18,'Penjagajokowi1',1.31029e+18,1.31029e+18
3,1.31029e+18,"'Penjagajokowi1',",1.31029e+18,'Penjagajokowi1',1.31029e+18,1.31029e+18
4,1.31029e+18,"'Penjagajokowi1',",1.31029e+18,'Penjagajokowi1',1.31029e+18,1.31029e+18


In [17]:
# curate the columns as: source, target, source name, target name

network['source_name']=network['user_name']
network['target_name']=network['reply_name']
network.drop(columns=['user_id','user_name','Unnamed: 2','reply_name'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [20]:
network.head()

Unnamed: 0,source,target,source_name,target_name
0,1.03587e+18,1.03587e+18,"'hyacinthyou_',",'hyacinthyou_'
1,1.58299e+18,377233100.0,"'alparezi11',",'AbdinegaraSetia'
2,1.31029e+18,1.31029e+18,"'Penjagajokowi1',",'Penjagajokowi1'
3,1.31029e+18,1.31029e+18,"'Penjagajokowi1',",'Penjagajokowi1'
4,1.31029e+18,1.31029e+18,"'Penjagajokowi1',",'Penjagajokowi1'


In [19]:
# remove commas from names:

new_name=[]
for i in network['source_name']:
    new_name.append(str(i).rstrip(','))
    

In [27]:
# remove quotation marks from source names:

newer_name=[]
for i in new_name:
    newer_name.append(str(i).strip("\'"))
    

In [29]:
network['source_name']=newer_name
network.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,source,target,source_name,target_name
0,1.03587e+18,1.03587e+18,hyacinthyou_,'hyacinthyou_'
1,1.58299e+18,377233100.0,alparezi11,'AbdinegaraSetia'
2,1.31029e+18,1.31029e+18,Penjagajokowi1,'Penjagajokowi1'
3,1.31029e+18,1.31029e+18,Penjagajokowi1,'Penjagajokowi1'
4,1.31029e+18,1.31029e+18,Penjagajokowi1,'Penjagajokowi1'


In [30]:
# remove quotation marks from target names:

target_name=[]
for i in network['target_name']:
    target_name.append(str(i).strip("\'"))

In [31]:
network['target_name']=target_name
network.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,source,target,source_name,target_name
0,1.03587e+18,1.03587e+18,hyacinthyou_,hyacinthyou_
1,1.58299e+18,377233100.0,alparezi11,AbdinegaraSetia
2,1.31029e+18,1.31029e+18,Penjagajokowi1,Penjagajokowi1
3,1.31029e+18,1.31029e+18,Penjagajokowi1,Penjagajokowi1
4,1.31029e+18,1.31029e+18,Penjagajokowi1,Penjagajokowi1


In [33]:
network.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [34]:
network.head()

Unnamed: 0,source,target,source_name,target_name
0,1.03587e+18,1.03587e+18,hyacinthyou_,hyacinthyou_
1,1.58299e+18,377233100.0,alparezi11,AbdinegaraSetia
2,1.31029e+18,1.31029e+18,Penjagajokowi1,Penjagajokowi1
6,8.84689e+17,8.84689e+17,GerBangNKRI,GerBangNKRI
7,1.28463e+18,1.28463e+18,Ikhaamina,Ikhaamina


The few steps below is to remove self-replying data, we want to know which user replies to which other user(s), self-reply is not relevant in this case.

In [52]:
network['self'] = network['source']-network['target']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [55]:
zero_self=network.loc[network['self']==0.0,:]
zero_self.head()

Unnamed: 0,source,target,source_name,target_name,self
0,1.03587e+18,1.03587e+18,hyacinthyou_,hyacinthyou_,0.0
2,1.31029e+18,1.31029e+18,Penjagajokowi1,Penjagajokowi1,0.0
6,8.84689e+17,8.84689e+17,GerBangNKRI,GerBangNKRI,0.0
7,1.28463e+18,1.28463e+18,Ikhaamina,Ikhaamina,0.0
9,1.33371e+18,1.33371e+18,Majidan7,Majidan7,0.0


In [58]:
zero_self.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 529 entries, 0 to 2351
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   source       529 non-null    float64
 1   target       529 non-null    float64
 2   source_name  529 non-null    object 
 3   target_name  529 non-null    object 
 4   self         529 non-null    float64
dtypes: float64(3), object(2)
memory usage: 41.0+ KB


In [56]:
# dataframe without self-replying

network_noself=network[~network.isin(zero_self)].dropna(how = 'all')

In [57]:
network_noself.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 851 entries, 1 to 2358
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   source       851 non-null    float64
 1   target       851 non-null    float64
 2   source_name  851 non-null    object 
 3   target_name  851 non-null    object 
 4   self         851 non-null    float64
dtypes: float64(3), object(2)
memory usage: 39.9+ KB


In [59]:
network_noself.head()

Unnamed: 0,source,target,source_name,target_name,self
1,1.58299e+18,377233100.0,alparezi11,AbdinegaraSetia,1.58299e+18
10,1548141000.0,8.87744e+17,esasuryo,OposisiCerdas,-8.87744e+17
11,1548141000.0,1.46665e+18,esasuryo,papa_loren,-1.46665e+18
15,1.08231e+18,1.20211e+18,MamJr4,Leonita_Lestari,-1.198e+17
16,1.08231e+18,419572000.0,MamJr4,RyaWiedy,1.08231e+18


In [67]:
# establsih nodes

nodes = pd.DataFrame(data=network_noself['target'])
nodes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 851 entries, 1 to 2358
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   target  851 non-null    float64
dtypes: float64(1)
memory usage: 13.3 KB


In [68]:
nodes['id']=nodes['target']
nodes['label']=network_noself['target_name']
nodes.drop_duplicates(inplace=True)
nodes.drop(columns=['target'],inplace=True)
print(nodes.info())
nodes.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 387 entries, 1 to 2357
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      387 non-null    float64
 1   label   387 non-null    object 
dtypes: float64(1), object(1)
memory usage: 9.1+ KB
None


Unnamed: 0,id,label
1,377233100.0,AbdinegaraSetia
10,8.87744e+17,OposisiCerdas
11,1.46665e+18,papa_loren
15,1.20211e+18,Leonita_Lestari
16,419572000.0,RyaWiedy


In [69]:
# export nodes to csv for network visualization on gephi

nodes.to_csv('../data/nodes_gep.csv',index=False)

In [63]:
# establish data for edges

edges = pd.DataFrame(data=network_noself['source'])
edges.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 851 entries, 1 to 2358
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   source  851 non-null    float64
dtypes: float64(1)
memory usage: 13.3 KB


In [64]:
edges['target']=network_noself['target']
edges['source_name']=network_noself['source_name']
edges['target_name']=network_noself['target_name']
edges.head()

Unnamed: 0,source,target,source_name,target_name
1,1.58299e+18,377233100.0,alparezi11,AbdinegaraSetia
10,1548141000.0,8.87744e+17,esasuryo,OposisiCerdas
11,1548141000.0,1.46665e+18,esasuryo,papa_loren
15,1.08231e+18,1.20211e+18,MamJr4,Leonita_Lestari
16,1.08231e+18,419572000.0,MamJr4,RyaWiedy


In [65]:
print(edges['source_name'].nunique())
edges['target_name'].nunique()

594


387

In [66]:
# export edges to csv for network visualization on gephi

edges.to_csv('../data/edges_gep.csv',index=False)

#### Part 2: Data prep for Tableau visualization
This part prepares data for topic classification slide on [Tableau](https://public.tableau.com/app/profile/m.alexander8473/viz/capitalrelocationtwitteranalysis/presentation?publish=yes)<br>

The data used here are the results of topic classification using IndoBert GPT2, done on [notebook 4](https://colab.research.google.com/drive/1-YByOO9JaoM5d9Feyd_vfaIQF4kJbu9M#scrollTo=LRNJPxMre_J1)

In [3]:
# import topic classification results from notebook 4

topic=pd.read_csv('../data/topic_class_oct.csv')

In [4]:
topic.head()

Unnamed: 0,politics,technologies
0,0.5163,0.5133
1,0.6435,0.6426
2,0.6308,0.5974
3,0.5758,0.5365
4,0.6103,0.5723


In [6]:
# get the score difference between politics and technologies

topic['delta']= topic['politics'] - topic['technologies']
topic.head()

Unnamed: 0,politics,technologies,delta
0,0.5163,0.5133,0.003
1,0.6435,0.6426,0.0009
2,0.6308,0.5974,0.0334
3,0.5758,0.5365,0.0393
4,0.6103,0.5723,0.038


In [8]:
# import tweets and add into topic df as a new tweet column

tweets = pd.read_csv('../data/labeled_tweets.csv')

In [11]:
tweets= tweets.head(100)
topic['tweets']=tweets['tweets']
topic = topic[['tweets','politics','technologies','delta']]
topic.head()

Unnamed: 0,tweets,politics,technologies,delta
0,sensasi berada di ibu kota nusantara gimana ya...,0.5163,0.5133,0.003
1,dengan metaverse bernama jagat nusantara anda ...,0.6435,0.6426,0.0009
2,metaversa memberikan sensasi berada di ibu kot...,0.6308,0.5974,0.0334
3,meski pembangunan ikn baru tahap awal melalui ...,0.5758,0.5365,0.0393
4,meskipun pembangunan ikn baru tahap awalnamun ...,0.6103,0.5723,0.038


In [12]:
topic.rename(columns={'technologies':'technology'},inplace=True)

In [13]:
topic.head()

Unnamed: 0,tweets,politics,technology,delta
0,sensasi berada di ibu kota nusantara gimana ya...,0.5163,0.5133,0.003
1,dengan metaverse bernama jagat nusantara anda ...,0.6435,0.6426,0.0009
2,metaversa memberikan sensasi berada di ibu kot...,0.6308,0.5974,0.0334
3,meski pembangunan ikn baru tahap awal melalui ...,0.5758,0.5365,0.0393
4,meskipun pembangunan ikn baru tahap awalnamun ...,0.6103,0.5723,0.038


The few lines of codes below set thresholds to classify scores as follow:<br>
- scores lower than 50% will be classified as neutral which means neither politics nor technology
- because politics always scores higher than technology, another condition was introduced:<br> 
    if the difference between politics and technology is > 5%, the tweet is classified as politics,
    but if the difference is < 5%, the tweet is classified as technology and politics.

In [15]:
# set temporary variable topic_a

topic_a=[]
for i in topic['politics']:
    if i < 0.5:
        topic_a.append(0)
    else:
        topic_a.append(i)

In [17]:
# add a temporary column A
topic['A']=topic_a

In [20]:
# add a temporary column B
topic['B']=topic['A']-topic['technology']
topic.head()

Unnamed: 0,tweets,politics,technology,delta,A,B
0,sensasi berada di ibu kota nusantara gimana ya...,0.5163,0.5133,0.003,0.5163,0.003
1,dengan metaverse bernama jagat nusantara anda ...,0.6435,0.6426,0.0009,0.6435,0.0009
2,metaversa memberikan sensasi berada di ibu kot...,0.6308,0.5974,0.0334,0.6308,0.0334
3,meski pembangunan ikn baru tahap awal melalui ...,0.5758,0.5365,0.0393,0.5758,0.0393
4,meskipun pembangunan ikn baru tahap awalnamun ...,0.6103,0.5723,0.038,0.6103,0.038


In [21]:
# now classify scores in column B based on the following conditions:

classy=[]
for b in topic['B']:
    if b < 0:
        classy.append('neutral')
    elif b > 0.05:
        classy.append('politics')
    else:
        classy.append('technology & politics')

In [23]:
# prepare final dataframe for Tableau

topic['topic']=classy
topic=topic[['tweets','politics','technology','topic']]
topic.head()

Unnamed: 0,tweets,politics,technology,topic
0,sensasi berada di ibu kota nusantara gimana ya...,0.5163,0.5133,technology & politics
1,dengan metaverse bernama jagat nusantara anda ...,0.6435,0.6426,technology & politics
2,metaversa memberikan sensasi berada di ibu kot...,0.6308,0.5974,technology & politics
3,meski pembangunan ikn baru tahap awal melalui ...,0.5758,0.5365,technology & politics
4,meskipun pembangunan ikn baru tahap awalnamun ...,0.6103,0.5723,technology & politics


In [25]:
# export for Tableau plotting and presentation:

topic.to_csv('../data/topic.csv',index=False)