## Twitter data set
In this notebook, data from the Twitter15 data set [Liu et al.\*] is merged for a BERT-based disinformation classification tool:
1. Tweet features: `#followers`, `user_engagement`, `verified profile`, `length`, `#hashs`, `#mentions`, `#URLs`, `sentiment_score`;
2. Tweet content (source tweets); 
3. Veracity label (disinformation, or no disinformation).

The sentiment score is based on the VADER sentiment analysis tool\*\*.

\*Liu, Xiaomo and Nourbakhsh, Armineh and Li, Quanzhi and Fang, Rui and Shah, Sameena, in *Proceedings of the 24th ACM International on Conference on Information and Knowledge Management* (2015) [[link to dataset]](https://www.dropbox.com/s/7ewzdrbelpmrnxu/rumdetect2017.zip?dl=0&file_subpath=%2Frumor_detection_acl2017)

\*\* https://github.com/cjhutto/vaderSentiment

### Load libraries

In [1]:
import pandas as pd

### 1. Load tweet features
741 tweets features

In [3]:
# user data
path_feat = './src/df_final.csv'
df_feat = pd.read_csv(path_feat, index_col=0)

# only root tweet 
df_root = df_feat[df_feat['depth']==0]

# only author characteristics 
df_root = df_root[['#followers', 'user_engagement', 'verified', 'depth','user_id1', 'tweet_id1',
                   'user_id2', 'length', '#hashs', '#mentions', '#URLs', 'sentiment_score']]

df_root.tail()

Unnamed: 0,#followers,user_engagement,verified,depth,user_id1,tweet_id1,user_id2,length,#hashs,#mentions,#URLs,sentiment_score
228138,26440,41.599219,1,0.0,ROOT,523939598691741696,58524428,102,0,1,2,0.5411
228237,229578,18.716981,0,0.0,ROOT,534205316620382208,292777349,82,0,0,2,0.0
228368,1124,8.984467,0,0.0,ROOT,523558773345255424,311660957,140,1,0,0,-0.5106
228483,1845314,11.056478,1,0.0,ROOT,559863517151784960,167421802,82,0,0,0,0.4404
228888,17401942,61.400081,1,0.0,ROOT,518870005677826050,3108351,99,0,0,2,0.0


In [6]:
# size of data frame
df_root.shape

(741, 12)

### 2. Load tweet content

In [7]:
# source tweets
path_tweet = './src/source_tweets.txt'
df_tweet = pd.read_csv(path_tweet, sep="	", header=None)
df_tweet.columns = ['tweet_id1','text']
df_tweet.head()

Unnamed: 0,tweet_id1,text
0,731166399389962242,🔥ca kkk grand wizard 🔥 endorses @hillaryclinto...
1,714598641827246081,an open letter to trump voters from his top st...
2,691809004356501505,america is a nation of second chances —@potus ...
3,693204708933160960,"brandon marshall visits and offers advice, sup..."
4,551099691702956032,rip elly may clampett: so sad to learn #beverl...


In [8]:
# size of data frame
df_tweet.shape

(1490, 2)

Includes tweets with label other than `true` or `false`, such as `unverified` and `non-rumor` (748 in total).

### 3. Load veracity label

In [9]:
# veracity (label)
path_label = './src/label.txt'
df_label = pd.read_csv(path_label, sep=":", header=None)
df_label.columns = ['label','tweet_id1']
df_label.head()

Unnamed: 0,label,tweet_id1
0,unverified,731166399389962242
1,unverified,714598641827246081
2,non-rumor,691809004356501505
3,non-rumor,693204708933160960
4,true,551099691702956032


In [10]:
df_label['label'].value_counts()

unverified    374
non-rumor     374
true          372
false         370
Name: label, dtype: int64

### Merge 1. tweet features with 2. source tweets 

In [11]:
df1 = pd.merge(df_root, df_tweet, on=['tweet_id1'])
df1.head()

Unnamed: 0,#followers,user_engagement,verified,depth,user_id1,tweet_id1,user_id2,length,#hashs,#mentions,#URLs,sentiment_score,text
0,15375121,72.567469,1,0.0,ROOT,489800427152879616,2467791,95,0,0,2,-0.3182,malaysia airlines says it lost contact with pl...
1,3673898,55.294333,1,0.0,ROOT,560474897013415936,59553554,118,0,1,1,0.8398,for just $1 you can get a free jr. frosty with...
2,1274260,32.033388,1,0.0,ROOT,524928119955013632,19038934,133,1,0,0,-0.7269,police say they have located car belonging to ...
3,13955752,64.548896,1,0.0,ROOT,518830518792892416,51241574,96,0,0,1,-0.34,mexico security forces hunting 43 missing stud...
4,189683,24.726166,1,0.0,ROOT,551117430345711616,2280470022,96,0,0,2,0.0,news saudi arabia's national airline planning ...


### Merge 1. tweet features and 2. source tweets with 3. veracity label 

In [12]:
df_final = pd.merge(df1, df_label, on=['tweet_id1'])

# map label values
di = {'false':1, 'true':0}
df_final['label'] = df_final['label'].map(di) 

df_final.head()

Unnamed: 0,#followers,user_engagement,verified,depth,user_id1,tweet_id1,user_id2,length,#hashs,#mentions,#URLs,sentiment_score,text,label
0,15375121,72.567469,1,0.0,ROOT,489800427152879616,2467791,95,0,0,2,-0.3182,malaysia airlines says it lost contact with pl...,1
1,3673898,55.294333,1,0.0,ROOT,560474897013415936,59553554,118,0,1,1,0.8398,for just $1 you can get a free jr. frosty with...,1
2,1274260,32.033388,1,0.0,ROOT,524928119955013632,19038934,133,1,0,0,-0.7269,police say they have located car belonging to ...,0
3,13955752,64.548896,1,0.0,ROOT,518830518792892416,51241574,96,0,0,1,-0.34,mexico security forces hunting 43 missing stud...,1
4,189683,24.726166,1,0.0,ROOT,551117430345711616,2280470022,96,0,0,2,0.0,news saudi arabia's national airline planning ...,1


### Export data 

In [16]:
df_final.to_csv('./twitter_full.csv', index=False)