## Twitter data set
In this notebook, the Twitter15 and Twitter16\* datasets are merged. The final dataframe will be fed to a BERT-based disinformation classification tool. The final dataframe consists of `1.057` rows with the following columns:
- `label` (disinformation or not)
- `tweet_id`
- `tweet`.

Tweet features: 
- `length`
- `#URLs`
- `#mentions`
- `#hashs`
- `sentiment_score`\*\*.

User features:
- `verified` (profile)
- `#followers`
- `user_engagement`.

As defined by Voshughi\*\*\*, `user_engagement` is defined as (`#tweets` + `#retweets` + `#replies` + `#favourites`) / `account age`. See `./Twitter_API_data_collection.ipynb`.

\*Liu, Xiaomo and Nourbakhsh, Armineh and Li, Quanzhi and Fang, Rui and Shah, Sameena, in *Proceedings of the 24th ACM International on Conference on Information and Knowledge Management* (2015) [[link to dataset]](https://www.dropbox.com/s/7ewzdrbelpmrnxu/rumdetect2017.zip?dl=0&file_subpath=%2Frumor_detection_acl2017)

\*\* Based on the VADER sentiment analysis tool, https://github.com/cjhutto/vaderSentiment

\*\*\* Vosoughi, S., Roy, D., and Aral, S.: The spread of true and false news online. *Science* 359, 6380 (2018), 1146–1151.

### Overview of notebook:
1. Load data Twitter 15
- Merge tweet features (A) with tweet content (B)
- Merge tweet veracity (C) with merged tweet features & tweet content (A<>B)
- Data cleaning
2. Load data Twitter 16
- Data cleaning
3. Merge and export final dataframe

### Load libraries

In [1]:
import pandas as pd

### 1. Load data Twitter15
#### A. Tweet features dataframe (Twitter15)

In [2]:
# user data
path_feat = './src/Twitter15/df_final.csv'
df_feat = pd.read_csv(path_feat, index_col=0)

# only root tweet 
df_root = df_feat[df_feat['depth']==0]

# only author characteristics 
df_root = df_root[['#followers', 'user_engagement', 'verified', 'depth','user_id1', 'tweet_id1',
                   'user_id2', 'length', '#hashs', '#mentions', '#URLs', 'sentiment_score']]

df_root.head()

Unnamed: 0,#followers,user_engagement,verified,depth,user_id1,tweet_id1,user_id2,length,#hashs,#mentions,#URLs,sentiment_score
0,15375121,72.567469,1,0.0,ROOT,489800427152879616,2467791,95,0,0,2,-0.3182
519,3673898,55.294333,1,0.0,ROOT,560474897013415936,59553554,118,0,1,1,0.8398
639,1274260,32.033388,1,0.0,ROOT,524928119955013632,19038934,133,1,0,0,-0.7269
964,13955752,64.548896,1,0.0,ROOT,518830518792892416,51241574,96,0,0,1,-0.34
1073,189683,24.726166,1,0.0,ROOT,551117430345711616,2280470022,96,0,0,2,0.0


In [3]:
# size of data frame
df_root.shape

(741, 12)

#### B. Tweet content dataframe

In [4]:
# source tweets
path_tweet = './src/Twitter15/source_tweets.txt'
df_tweet = pd.read_csv(path_tweet, sep="	", header=None)
df_tweet.columns = ['tweet_id1','text']
df_tweet.head()

Unnamed: 0,tweet_id1,text
0,731166399389962242,🔥ca kkk grand wizard 🔥 endorses @hillaryclinto...
1,714598641827246081,an open letter to trump voters from his top st...
2,691809004356501505,america is a nation of second chances —@potus ...
3,693204708933160960,"brandon marshall visits and offers advice, sup..."
4,551099691702956032,rip elly may clampett: so sad to learn #beverl...


In [5]:
# Includes tweets with label other than `true` or `false` (742 in total), 
# such as `unverified` and `non-rumor` (748 in total).
df_tweet.shape

(1490, 2)

#### C. Veracity label dataframe

In [6]:
# veracity (label)
path_label = './src/Twitter15/label.txt'
df_label = pd.read_csv(path_label, sep=":", header=None)
df_label.columns = ['label','tweet_id1']
df_label.head()

Unnamed: 0,label,tweet_id1
0,unverified,731166399389962242
1,unverified,714598641827246081
2,non-rumor,691809004356501505
3,non-rumor,693204708933160960
4,true,551099691702956032


In [7]:
df_label['label'].value_counts()

unverified    374
non-rumor     374
true          372
false         370
Name: label, dtype: int64

### Merge A. tweet features with B. source tweets

In [8]:
df1 = pd.merge(df_root, df_tweet, on=['tweet_id1'])
df1.head()

Unnamed: 0,#followers,user_engagement,verified,depth,user_id1,tweet_id1,user_id2,length,#hashs,#mentions,#URLs,sentiment_score,text
0,15375121,72.567469,1,0.0,ROOT,489800427152879616,2467791,95,0,0,2,-0.3182,malaysia airlines says it lost contact with pl...
1,3673898,55.294333,1,0.0,ROOT,560474897013415936,59553554,118,0,1,1,0.8398,for just $1 you can get a free jr. frosty with...
2,1274260,32.033388,1,0.0,ROOT,524928119955013632,19038934,133,1,0,0,-0.7269,police say they have located car belonging to ...
3,13955752,64.548896,1,0.0,ROOT,518830518792892416,51241574,96,0,0,1,-0.34,mexico security forces hunting 43 missing stud...
4,189683,24.726166,1,0.0,ROOT,551117430345711616,2280470022,96,0,0,2,0.0,news saudi arabia's national airline planning ...


### Merge C. tweet veracity with merged (A. tweet features <> B. source tweets)

In [9]:
df_final15 = pd.merge(df1, df_label, on=['tweet_id1'])

# map label values
di = {'false':1, 'true':0}
df_final15['label'] = df_final15['label'].map(di) 

# drop columns
df_final15 = df_final15.drop(columns=['user_id1', 'user_id2'])

# rename columns
df_final15.rename(columns={'text':'tweet', 'tweet_id1':'tweet_id'}, inplace=True)

# reorder columns
cols = ['label', 'tweet_id', 'tweet', 'length', '#URLs', '#mentions', '#hashs', 'verified', '#followers', 'user_engagement', 'sentiment_score']
df_final15 = df_final15[cols]

df_final15.head()

Unnamed: 0,label,tweet_id,tweet,length,#URLs,#mentions,#hashs,verified,#followers,user_engagement,sentiment_score
0,1,489800427152879616,malaysia airlines says it lost contact with pl...,95,2,0,0,1,15375121,72.567469,-0.3182
1,1,560474897013415936,for just $1 you can get a free jr. frosty with...,118,1,1,0,1,3673898,55.294333,0.8398
2,0,524928119955013632,police say they have located car belonging to ...,133,0,0,1,1,1274260,32.033388,-0.7269
3,1,518830518792892416,mexico security forces hunting 43 missing stud...,96,1,0,0,1,13955752,64.548896,-0.34
4,1,551117430345711616,news saudi arabia's national airline planning ...,96,2,0,0,1,189683,24.726166,0.0


In [10]:
df_final15.shape

(741, 11)

### 2. Load data Twitter16

In [11]:
# user data
path_feat = './twitter16_full.csv'
df_twitter16 = pd.read_csv(path_feat)
df_twitter16.head()

Unnamed: 0,label,tweet_id,tweet,length,#URLs,#mentions,#hashs,verified,#followers,#replies,#retweets,#tweets,#favourites,account_age,user_engagement,sentiment_score
0,0,615689290706595840,.@whitehouse in rainbow colors for #scotusmarr...,96,1,1,1,1,2718461,50,264,20582,500,2216,9.655235,0.0
1,1,613404935003217920,cops bought the alleged church shooter burger ...,75,1,0,0,1,11232698,36,88,615466,27,2616,235.327599,-0.6705
2,1,622891631293935616,#wakeupamerica🇺🇸 who needs a #gun registry whe...,96,2,0,3,0,155313,14,112,302421,67,2305,131.2859,-0.34
3,0,553589051044151296,several hostages freed at jewish supermarket i...,83,1,0,1,1,199664,7,138,36560,20,823,44.623329,0.4019
4,0,553590835850514433,"hostage-taker in supermarket siege killed, rep...",77,2,0,1,1,8278077,16,184,563067,81,2731,206.279019,-0.6705


In [12]:
# drop columns
df_final16 = df_twitter16.drop(columns=['#favourites', '#replies', '#retweets', '#tweets', 'account_age'])
df_final16.head()

Unnamed: 0,label,tweet_id,tweet,length,#URLs,#mentions,#hashs,verified,#followers,user_engagement,sentiment_score
0,0,615689290706595840,.@whitehouse in rainbow colors for #scotusmarr...,96,1,1,1,1,2718461,9.655235,0.0
1,1,613404935003217920,cops bought the alleged church shooter burger ...,75,1,0,0,1,11232698,235.327599,-0.6705
2,1,622891631293935616,#wakeupamerica🇺🇸 who needs a #gun registry whe...,96,2,0,3,0,155313,131.2859,-0.34
3,0,553589051044151296,several hostages freed at jewish supermarket i...,83,1,0,1,1,199664,44.623329,0.4019
4,0,553590835850514433,"hostage-taker in supermarket siege killed, rep...",77,2,0,1,1,8278077,206.279019,-0.6705


In [13]:
df_final16.shape

(316, 11)

### 3. Merge and export final dataframe

In [14]:
frames = [df_final15, df_final16]
df_final = pd.concat(frames)
df_final.head()

Unnamed: 0,label,tweet_id,tweet,length,#URLs,#mentions,#hashs,verified,#followers,user_engagement,sentiment_score
0,1,489800427152879616,malaysia airlines says it lost contact with pl...,95,2,0,0,1,15375121,72.567469,-0.3182
1,1,560474897013415936,for just $1 you can get a free jr. frosty with...,118,1,1,0,1,3673898,55.294333,0.8398
2,0,524928119955013632,police say they have located car belonging to ...,133,0,0,1,1,1274260,32.033388,-0.7269
3,1,518830518792892416,mexico security forces hunting 43 missing stud...,96,1,0,0,1,13955752,64.548896,-0.34
4,1,551117430345711616,news saudi arabia's national airline planning ...,96,2,0,0,1,189683,24.726166,0.0


In [15]:
df_final.shape

(1057, 11)

In [16]:
df_final.to_csv('./twitter1516_final.csv', index=False)