# **Project SPOTTED: - Data Selection, Collection and Sampling**

**<u>_Objective:_</u>** We fine-tune a pretrained BERTModel, to predict if a tweet is made by an information operative (state-sponsored troll) or not. The aim is to increase the efficiency of 


from state trolls for the defense and intelligence community.

---
In this notebook, I describe the approach in the selection, collection and sampling of data that will be used in Project SPOTTED.

Data forms the bedrock of any machine learning models. During the collection phase, we must ensure the data integrity is maintained, so that the data is accurate and consistent over its life cycle. At worst, if the data is biased, or is unrepresentative from the get go, then all our efforts will be in vain when the biased data is feed into our models. Before we begin, we recognize the limitations in dealing with very large datasets in Pandas. When unzipped, the size of the entire troll dataset archive is **118 Gb**. Furthermore, Pandas is prone to crashing when it reads csv whose size is larger than 3 gb. To go around this, we shall use random sampling to help us obtain a sample that is as representative as possible to the entire population dataset. 

We aim to produce a dataset of **200 000 tweets** for Project SPOTTED. First, I will explain my approach in selecting the datasource as the first step of the collection process.

### Data Selection

The procedure to produce the 200 000 tweets can be summarized as follows:
1. Choose 20 troll datasets from the Twitter Moderation Research Consortium (see below), randomly sample 10 000 tweets from each where possible
2. Choose 50 verified Twitter accounts and perform stratified random sampling:
   - Categorize the 50 accounts into 5 strata
   - For each account within each stratum, collect 2500 to 3500 tweets where possible
   - Randomly sample tweets within each strata accordingly to the strata's significance (given in Table 2)
   - Pool all of the random samples together - giving a total of 100 000 tweets - to form the verified Twitter dataset. Randomly shuffle the dataset.
3. Concatenate these two large datasets. Randomly shuffle the dataset
4. Split the first 150 000 rows as **Training Dataset**. This will be the dataset will enter into the train, test  split function for training the ML model.
5. Split the last 5000 rows as **Validation Dataset**. This will be the dataset unseen by the model, which will be used by the model to predict if the tweet is made by a troll or not.

The overview of the entire collection process can be summarized in the following chart: 


<p align="center">  
  <img src="https://github.com/QuekJingHao/google-data-analytics-capstone-project/blob/main/4 Images/db_overall.png" width="800" height="600">
</p>


In selecting our data, we aim to achieve the following objectives:
1. Sample from the entire repository between 2018 and 2021 
   - The reason is simple: state actors change and modify their MO and tradecraft over time. For example, Russian trolls may target the 2016 elections by spreading disinformation. However in 2021, they may attack the botched American withdrawal from Afghanistan, by amplifying the number of American casualties.
2. Select multiple verified Twitter accounts
   - We again should be as representative as possible to cover a large range of topics.
3. Ensure there is no data leakage of validation data into training dataset
   - We should remove any duplicated tweets if present. We should also randomly shuffle the dataset before we split it into training or validation datasets. In this way, we ensure that the none of the tweets used for validation somehow gets into the training dataset. If this happens, then the performance of the model will be much better than expected.
   <br></br>

In the next two subsections, I'll describe the selection process for the troll and clean datasets. 
<br></br>



#### <u>Selection of State-linked Troll Datasets</u>

The information ops dataset can be downloaded from the Twitter Moderation Research Consortium (TMRC) via the following link: https://transparency.twitter.com/en/reports/moderation-research.html. The TMRC overseas moderation and transparency on Twitter. Such measures include the disclosure of state-linked operatives engaging in platform manipulation and information operations. With the formation of the Consortium, newer datasets are no longer public but shared with members of the Consortium. Nonetheless, the goldmine of an archive released from 2018 to 2021 is more than sufficient for our purpose. 

The following table lists all of the 20 selected troll datasets:

<center>

|                                        |                                                |
|:--------------------------------------:|:----------------------------------------------:|
| 1.  IRA Oct 2018                       | 11.  Indonesia Feb 2020                        |
| 2.  Iran Oct 2018                      | 12.  China May 2020                            |
| 3.  Russia Jan 2019                    | 13.  Russia May 2020 $\dagger$                 |
| 4.  Venezuela Jun 2019 $\dagger$       | 14.  Thailand Sept 2020                        |
| 5.  Iran Jan 2019 $\dagger$            | 15.  Iran Feb 2020                             |
| 6.  Iran (Set 1) June 2019             | 16.  Russia GRU Feb 2021                       |
| 7.  Iran (Set 2) June 2019             | 17.  Russia IRA Feb 2014                       |
| 8.  China (Set 1) Aug 2019             | 18.  China Changyu Culture Dec 2021 $\dagger$  |
| 9.  China (Set 2) Aug 2019             | 19.  China Xinjiang Dec 2021 $\dagger$         |
| 10. China (Set 3) Sept 2019 $\dagger$  | 20.  Venezuela Dec 2021 $\dagger$              |

</center>

$\dagger$ contains multiple CSVs to be merged


We will pick only the English tweets, drop any duplicates and mine 10 000 tweets from each of these datasets.
<br></br>



#### <u>Selection of Verified Twitter Users</u>

In selecting the accounts that make up the clean dataset, we have to pay close attention to the troll dataset. An exploratory data analysis (another notebook) on the troll datasets reveals some key aspects of the troll's MO:
1. The earliest information operations dates back to 2009
2. Majority of the trolls target US government, society and politics
3. Majority disguise their tweets as genuine "news", or amplify events detrimental to US interests and national security
4. Many disguise as genuine persons using Twitter, albeit parroting extreme and diversive political / social views and amplifying existing grieviances

Hence, we in selecting the verified accounts, we aim to mirror the above mentioned dynamic: we shall collect from many verified news media accounts, and include a fraction of US government accounts. However, we should avoid restricting the clean dataset to just focus on the US or international news. So I have thrown in several accounts related to Singapore. Lastly, some accounts related to entertainment, science and technology is included for good measure.  

Below shows the verified Twitter accounts, categorized into 5 different strata:

<center>

|             |                                      |                             |                                  |                               |                          |
|:-----------:|:-------------------------------------|:----------------------------|:---------------------------------|:------------------------------|:--------------------------|
|**[Strata]**   | US Politics                          | US Military                 | Singapore Government             | Entertainment, Science & Tech | International News        |
|**[Accounts]** | President of the United States       | Department of Defense       | PAP                              | Guns n Roses                  | CNN                       |
|             | Vice President of the United States  | US Cyber Command            | WP                               | Metallica                     | BBC World                 |
|             | The White House                      | Defense Intelligence Agency | Ministry of Defense              | NASA                          | New York Times            |
|             | Hillary Clinton                      | Central Intelligence Agency | Republic of Singapore Air Force  | Google                        | Washington Post           |
|             | Barack Obama                         | US Army                     | Gov Singapore                    | SpaceX                        | The Straits Times         |
|             |                                      | US Air Force                | Ministry of Home Affairs         |                               | Channel News Asia         |
|             |                                      | US Navy                     | Ministry of Education            |                               | TODAY Online              |
|             |                                      | US Marine Corps             | Ministry of Foreign Affairs      |                               | Washington Street Journal |
|             |                                      | Indo-Pacific Command        | Ministry of Health               |                               | Reuters                   |
|             |                                      |                             |                                  |                               | The Economist             |
|             |                                      |                             |                                  |                               | Financial Times           |
|             |                                      |                             |                                  |                               | Bloomberg                 |
|             |                                      |                             |                                  |                               | Forbes                    |
|             |                                      |                             |                                  |                               | CNBC                      |
|             |                                      |                             |                                  |                               | MSNBC                     |
|             |                                      |                             |                                  |                               | CBS News                  |
|             |                                      |                             |                                  |                               | ABC                       |
|             |                                      |                             |                                  |                               | CNN International         |
|             |                                      |                             |                                  |                               | New York Times World      |

</center>

The collection of Tweets is done using the OSINT package - Twitter Intelligence Tool (TWINT) - which bypasses the need for Twitter API.

The number of tweets we will sample from each of these strata is given in the following table:

<center>

| Strata                          | Number of tweets to randomly sample    |
|:-------------------------------:|:--------------------------------------:|
| US Politics                     | 10 000                                 |
| US Military                     | 10 000                                 |
| Singapore Government            | 10 000                                 |
| Entertainment, Science & Tech   | 10 000                                 |
| International News              | 70 000                                 |

Table 2: Number of tweets to randomly sample within each strata
    
</center>

The stratified random sampling process can be summarized in the following chart:

<p align="center">  
  <img src="https://github.com/QuekJingHao/google-data-analytics-capstone-project/blob/main/4 Images/db_overall.png" width="800" height="600">
</p>

Lastly, we encode the troll and verified datasets as follows:

<center>

|            |      |  
|:----------:|:----:|
| Troll:     | 1    |
| Verified:  | 0    |

</center>


In [None]:
# import modules and dependencies
import numpy as np
import pandas as pd
import twint
import os
import random
import shutil as sh
import nest_asyncio
import snscrape.modules.twitter as sntwitter

from Data_Collection_Utility import *

pd.set_option('display.max_colwidth', None)

### Sampling Troll Dataset

We would need to combine the datasets with the dagger first. Then afterwhich we sample 10 000 from each of the datasets, and combine them together.

In [None]:
# different paths for running the notebook on thinkpad
troll_path = 'E:/SPOTTED Data Collection/Data/Troll/'
#troll_path = 'C:/Users/jh.quek/Documents/SPOTTED Data Collection/Data/Troll/'

In [None]:
multiple_datasets = ['Venezuela_Jan_2019', 'Iran_Jan_2019', 'China_S3_Sept_2019', 'Russia_May_2020', 
                     'China_Changyu_Culture_Dec_2021', 'China_Xinjiang_Dec_2021', 'Venezuela_Dec_2021']

for dataset in multiple_datasets:
    print('Combining', dataset)
    list_of_datasets = os.listdir(troll_path + dataset)
    
    print(list_of_datasets)
    
    combined_df = dataset_fusion(troll_path + dataset + '/', list_of_datasets)
    
    # Now that we are executing everything inside the hard drive we just save the merged dataframe inside
    combined_df.to_csv(troll_path + dataset + '/' + dataset + '.csv')

Combining Venezuela_Jan_2019
['Venezuela_Jan_2019_P1.csv', 'Venezuela_Jan_2019_P2.csv', 'Venezuela_Jan_2019_P3.csv', 'Venezuela_Jan_2019.csv']
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Combining Iran_Jan_2019
['Iran_Jan_2019_P1.csv', 'Iran_Jan_2019_P2.csv', 'Iran_Jan_2019_P3.csv', 'Iran_Jan_2019_P4.csv', 'Iran_Jan_2019.csv']
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Combining China_S3_Sept_2019
['China_S3_Aug_2019_P1.csv', 'China_S3_Aug_2019_P2.csv', 'China_S3_Aug_2019_P3.csv', 'China_S3_Sept_2019.csv']
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Combining Russia_May_2020
['Russia_May_2020_P1.csv', 'Russia_May_2020_P2.csv', 'Russia_May_2020.csv']
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Combining China_Changyu_Culture_Dec_2021
['China_Changyu_Cult

Now, we are ready to sample 10 000 tweets from each of the 20 datasets and save them into a folder

In [None]:
# Need some clever methods to sample the huge datasets if executing on Acer
troll_datasets = [directory for directory in os.listdir(troll_path) if directory != 'Troll_Samples']
big_datasets = ['China_S3_Sept_2019', 'Iran_Jan_2019', 'Russia_IRA_Oct_2018', 
                'Russia_May_2020', 'Venezuela_Jan_2019']
"""
# Sampling 10000 rows from each of the 20 troll datasets - with a random state of 5 for reproductivity
i = 0
nrows = 10000
for dataset in troll_datasets:
    
    # for the very big datasets; we'll make pandas randomly read 80% of the entire dataset
    if dataset in big_datasets:
        p = 0.8
        np.random.seed(3)
        df = pd.read_csv('D:/SPOTTED Data Collection/Data/Troll/{}/{}.csv'.format(dataset, dataset), 
                         skiprows = lambda x : x > 0 and np.random.rand(1)[0] > p,
                         low_memory = False)
    else:
        df = pd.read_csv('D:/SPOTTED Data Collection/Data/Troll/{}/{}.csv'.format(dataset, dataset), 
                         low_memory = False)

    # we pick only the English tweets, 
    df_sample = df[df['tweet_language'] == 'en']
    
    # this step may blow up if the number of rows of english tweets is already less than the specified number
    try:
        df_sample = df_sample.sample(n = nrows, random_state = 5)
    except:
        df_sample = df_sample.sample(frac = 1.0, random_state = 5)
    
    df_sample.to_csv('D:/SPOTTED Data Collection/Data/Troll/Troll_Samples/{}_Sample.csv'.format(dataset, dataset))
    
    print('[*]-- Point [{}] --------- Sampling {} Complete --------- Length of dataframe: [{}] [*]'.format(i + 1, dataset, len(df_sample)))
    i += 1
"""
    
# lastly, concatentate the 20 samples into one sample and write to csv
troll_sample_datasets = [name + '_Sample.csv' for name in troll_datasets]
troll_combined_df = dataset_fusion(troll_path + 'Troll_Samples/', troll_sample_datasets)
print('Length of combined sampled troll dataset is', len(troll_combined_df))

# randomly shuffle the dataframe with random state of 10
troll_combined_df  = troll_combined_df.sample(frac = 1.0, random_state = 10)
troll_combined_df.to_csv(troll_path + 'Troll_Dataset_Combined.csv')

Length of China_Changyu_Culture_Dec_2021_Sample.csv dataframe is 4777
Length of China_May_2020_Sample.csv dataframe is 10000
Length of China_S1_Aug_2019_Sample.csv dataframe is 10000
Length of China_S2_Aug_2019_Sample.csv dataframe is 10000
Length of China_S3_Sept_2019_Sample.csv dataframe is 10000
Length of China_Xinjiang_Dec_2021_Sample.csv dataframe is 10000
Length of Indonesia_Feb_2020_Sample.csv dataframe is 10000
Length of Iran_Feb_2021_Sample.csv dataframe is 10000
Length of Iran_Jan_2019_Sample.csv dataframe is 10000
Length of Iran_Oct_2018_Sample.csv dataframe is 10000
Length of Iran_S1_June_2019_Sample.csv dataframe is 10000
Length of Iran_S2_June_2019_Sample.csv dataframe is 10000
Length of Russia_GRU_Feb_2021_Sample.csv dataframe is 10000
Length of Russia_IRA_Feb_2021_Sample.csv dataframe is 10000
Length of Russia_IRA_Oct_2018_Sample.csv dataframe is 10000
Length of Russia_Jan_2019_Sample.csv dataframe is 10000
Length of Russia_May_2020_Sample.csv dataframe is 10000
Length 

### Downloading Clean Dataset

Parameters specifications - we specify the limit of tweets to be extracted from each of the 50 accounts.

We also subject the scrapper to the following constraints:
* The date of the latest tweet should be on 1 Janurary 2022.
  - This is because of the latest troll tweet is around that time. We have to match the period of the clean data with that of the troll dataset.

In [None]:
verified_path = 'E:/SPOTTED Data Collection/Data/Verified/'

In [None]:
# below is the 5 strata we will be working with. Each element in the arrays is the username of the accounts
US_Politics_Stratum = ['POTUS', 'VP', 'WhiteHouse', 'BarackObama', 'HillaryClinton'] 

US_Military_Stratum = ['DeptofDefense', 'US_CYBERCOM', 'DefenseIntel', 'NSAGov', 'CIA', 
                       'USArmy', 'usairforce',  'USNavy', 'USMC', 'INDOPACOM']

SG_Stratum = ['PAPSingapore', 'wpsg', 'mindefsg', 'SingaporePolice', 'TheRSAF',
              'govsingapore', 'mhasingapore', 'MOEsg', 'MFAsg', 'sporeMOH']

EST_Stratum = ['gunsnroses', 'Metallica', 'NASA', 'SpaceX', 'Google']
    
International_News_Stratum = ['CNN', 'BBCWorld', 'nytimes', 'TIME', 'washingtonpost',
                              'straits_times', 'ChannelNewsAsia', 'TODAYonline', 'WSJ', 'Reuters',
                              'TheEconomist', 'FT', 'business', 'Forbes', 'CNBC',
                              'MSNBC', 'CBSNews', 'ABC', 'cnni', 'nytimesworld']

print('Number of accounts in each of these categories:\n\
      US Politics: {}\n\
      US Military: {}\n\
      SG Government: {}\n\
      EST: {}\n\
      International News: {}\n\
      '.format(len(US_Politics_Stratum), len(US_Military_Stratum), 
               len(SG_Stratum), len(EST_Stratum), len(International_News_Stratum)))


Number of accounts in each of these categories:
      US Politics: 5
      US Military: 10
      SG Government: 10
      EST: 5
      International News: 20
      


Now we will mine the verified accounts above for the 2000 tweets

In [None]:
# mine US politicans category
Twint_Scrapper(2500, US_Politics_Stratum, True, True, True)
move_verified_datasets([file + '.csv' for file in US_Politics_Stratum], 'US_Politics')

Collecting Tweets on POTUS ...
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
Length of dataframe: 2056
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on VP ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on WhiteHouse ...
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
Length of dataframe: 2283
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on BarackObama ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on HillaryClinton ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]


[*] Collection Succe

In [None]:
# mine US military category
Twint_Scrapper(2500, US_Military_Stratum, True, True, True)
move_verified_datasets([file + '.csv' for file in US_Military_Stratum], 'US_Military')

Collecting Tweets on DeptofDefense ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on US_CYBERCOM ...
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
Length of dataframe: 334
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on DefenseIntel ...
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
Length of dataframe: 2376
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on NSAGov ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on CIA ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets o

In [None]:
# mine Singapore accounts
Twint_Scrapper(2500, SG_Stratum, True, True, True)
move_verified_datasets([file + '.csv' for file in SG_Stratum], 'SG_Government')

Collecting Tweets on PAPSingapore ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on wpsg ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on mindefsg ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on SingaporePolice ...
Length of dataframe: 2513
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on TheRSAF ...
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
Length of dataframe: 2181
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on govsingapore ...
Length of dataframe: 2500
[*]-------------------------------

In [None]:
# mine EST accounts
Twint_Scrapper(2500, EST_Stratum, True, True, True)
move_verified_datasets([file + '.csv' for file in EST_Stratum], 'EST')

Collecting Tweets on gunsnroses ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on Metallica ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on NASA ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on SpaceX ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on Google ...
Length of dataframe: 2500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]


[*] Collection Successful. Total number of scrapped users: 5
Moving files to designated folder...
[*]--------------------------------------      COMPLETE      ------------------------------

In [None]:
# mine international news accounts
Twint_Scrapper(3500, International_News_Stratum, True, True, True) # need to mine more for international news
move_verified_datasets([file + '.csv' for file in International_News_Stratum], 'International_News')

Collecting Tweets on CNN ...
Length of dataframe: 3500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on BBCWorld ...
Length of dataframe: 3500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on nytimes ...
Length of dataframe: 3500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on TIME ...
Length of dataframe: 3500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on washingtonpost ...
Length of dataframe: 3500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on straits_times ...
Length of dataframe: 3500
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Collecting Tweets on 

In [None]:
# need to do something to the SPF dataset because it mined too many rows
verified_path = 'D:/SPOTTED Data Collection/Data/Verified/'
SPF_df = pd.read_csv(verified_path + 'SG_Government/SingaporePolice.csv')
SPF_df = SPF_df.drop(np.arange(2500, 2513)).reset_index(drop = True)
SPF_df.to_csv(verified_path + 'SG_Government/SingaporePolice.csv')

2500


After downloading the required tweets, we combine the corresponding datasets into a single dataframe according to their strata. Then, we randomly sample from the 5 different dataframe according Table 2.

In [None]:
verified_path = 'D:/SPOTTED Data Collection/Data/Verified/'
verified_directories = ['US_Politics', 'US_Military', 'SG_Government', 'EST', 'International_News']

# get the files in each of these directories and merge all of them into one file
df_merged_all = []
for directory in verified_directories:
    verified_datasets = os.listdir(verified_path + directory)
    print('List of datasets in', directory, ':\n', verified_datasets, '\n')
    
    merged_df = dataset_fusion(verified_path + '/{}/'.format(directory), verified_datasets)
    df_merged_all.append(merged_df)
    
# sample from each of the merged dataframes according to the number of rows in Table 2
verified_df_sampled = []
verified_df_nrows = [10000, 10000, 10000, 10000, 70000]
for i in range(5):
    sample_df = df_merged_all[i].sample(n = verified_df_nrows[i], random_state = i)
    verified_df_sampled.append(sample_df)

# lastly, concatenate them together
verified_df = pd.concat(verified_df_sampled, ignore_index = True)
    
# randomly shuffle the dataframe before writing to csv with random state 15
verified_df = verified_df.sample(frac = 1.0, random_state = 15)
verified_df.to_csv(verified_path + 'Verfied_Dataset.csv')

List of datasets in US_Politics :
 ['POTUS.csv', 'VP.csv', 'WhiteHouse.csv', 'BarackObama.csv', 'HillaryClinton.csv'] 

Length of POTUS.csv dataframe is 2056
Length of VP.csv dataframe is 2500
Length of WhiteHouse.csv dataframe is 2283
Length of BarackObama.csv dataframe is 2500
Length of HillaryClinton.csv dataframe is 2500
Length of merged dataframe is 11839, [True]
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

List of datasets in US_Military :
 ['DeptofDefense.csv', 'US_CYBERCOM.csv', 'DefenseIntel.csv', 'NSAGov.csv', 'CIA.csv', 'USArmy.csv', 'usairforce.csv', 'USNavy.csv', 'USMC.csv', 'INDOPACOM.csv'] 

Length of DeptofDefense.csv dataframe is 2500
Length of US_CYBERCOM.csv dataframe is 334
Length of DefenseIntel.csv dataframe is 2376
Length of NSAGov.csv dataframe is 2500
Length of CIA.csv dataframe is 2500
Length of USArmy.csv dataframe is 2500
Length of usairforce.csv dataframe is 2500
Length of USNavy.csv dataframe is 250

### Dataset Fusion

Now that we have the troll and clean datasets, we are ready to fuse them together, and write the results as a csv file. Here, we will only pick the required columns that we will use later in the project and drop the rest.

In [None]:
troll_all_df = pd.read_csv(troll_path + 'Troll_Dataset_Combined.csv', low_memory = False)

# drop the duplicates
troll_all_df = remove_df_duplicates(troll_all_df)
    
# take a random sample of 100000 rows from the merged troll dataframe
troll_df = troll_all_df.sample(n = 100000, random_state = 8)

troll_df = troll_df[['tweet_text', 'hashtags']].reset_index(drop = True)
troll_df['target'] = np.ones(100000)

Dropping duplicates... Initial length of dataframe: 183920
Removal Completed. Final length of dataframe: 168454


In [None]:
verified_all_df = pd.read_csv(verified_path + 'Verfied_Dataset.csv', low_memory = False)
verified_all_df = verified_all_df[['tweet', 'hashtags']]
verified_all_df = verified_all_df.rename({'tweet' : 'tweet_text'}, axis = 'columns')

# drop the duplicates
verified_all_df = remove_df_duplicates(verified_all_df)

# take a random sample of 100000 rows from the merged troll dataframe
verified_df = verified_all_df.sample(n = 100000, random_state = 67)

verified_df['target'] = np.zeros(100000)
verified_df.head()

Dropping duplicates... Initial length of dataframe: 110000
Removal Completed. Final length of dataframe: 109590


Unnamed: 0,tweet_text,hashtags,target
15369,"Harry Reid remembered for reshaping Obama presidency, Senate and Supreme Court by friends and foes https://t.co/8aKYuUYGew",[],0.0
18113,The Chinese army has been buying hundreds of new helicopters in just a few years' time https://t.co/gUSw0xmJa0,[],0.0
71338,Cats and birds being hoarded in Radin Mas cemetery hut by self-proclaimed spirit healer https://t.co/374ETpRUo7,[],0.0
5880,"#ForbesUnder30 Steve Wen, CEO of Dray Alliance, had the idea to modernize freight logistics while running a business that exported luxury goods. ""In the shipping world, they were still faxing paperwork around,"" he says. ""That made no sense to me."" https://t.co/eBEs1DJtcu",['forbesunder30'],0.0
44775,"Since the start of the #COVID19 pandemic, the DIA workforce has gone above &amp; beyond to continue the mission. It goes without saying, DIA's workforce is simply unmatched! On this #EmployeeAppreciationDay, join DIA Chief of Staff Johnny Sawyer in thanking the men &amp; women of DIA. https://t.co/q1v9Gir9Jl","['covid19', 'employeeappreciationday']",0.0


In [None]:
# merge the two dataframes together!
SPOTTED_df = pd.concat([troll_df, verified_df], ignore_index = True)

SPOTTED_df = SPOTTED_df.sample(frac = 1.0, random_state = 8).reset_index(drop = True)

# now split the dataframe - top half is test set and the bottom half is validation set
SPOTTED_test_df = SPOTTED_df[:150000].reset_index(drop = True)
SPOTTED_validation_df = SPOTTED_df[-5000:].reset_index(drop = True)

SPOTTED_test_df.to_csv('SPOTTED_test_dataset.csv')
SPOTTED_validation_df.to_csv('SPOTTED_validation_dataset.csv')
SPOTTED_df.to_csv('SPOTTED_dataset.csv')

In [None]:
print(len(SPOTTED_df))
SPOTTED_df.head(10)

200000


Unnamed: 0,tweet_text,hashtags,target
0,"As of 5 June 2020, 12pm, we have preliminarily confirmed an additional 261 cases of COVID-19 infection in Singapore. https://t.co/2RFMhrRkUw",[],0.0
1,"Boyfriend of missing Florida woman charged with murder: ""We wish Collin would provide us the information of where Kathleen is"" https://t.co/DBDJS5McdW",[],0.0
2,K-pop's BTS snags top prize at American Music Awards https://t.co/eR432aHJlm,[],0.0
3,RT @CincinnatiDays: Man killed in Bond Hill after altercation #news,[news],1.0
4,Jared paying attention to his video game more than me pt 2 @juliakim52 http://t.co/0AHkR3K7Vg,[],1.0
5,"Cat thought lost in Kentucky tornado found 9 days later: ""I thought I heard a meow"" https://t.co/iUGJr3kGDW",[],0.0
6,5 Little Things You Can Do That Have Compounding Effects On Your Savings https://t.co/3pSaMRZROB https://t.co/Z28gZSfVGP,[],0.0
7,RT @mashabletech: #Apple Reportedly Buys PrimeSense for $345 Million http://t.co/HWDdDRWoHI,['Apple'],1.0
8,"“We always say that early diagnosis is the best treatment for cancer. The chance of recovery is higher. Likewise, for COVID-19, if there is a way to prevent it, why not?"" – Teo Khee Huat, 78, colorectal and skin cancer survivor Read more: https://t.co/FR2QKY6N9P #IGotMyShotSG https://t.co/HPqB23CKhG",['igotmyshotsg'],0.0
9,"@aquarius021501 Hi there. Are you getting a specific error message when you try signing into your Google account? Without revealing your email address, give us the exact wording &amp; we'll try and point you in the right direction. This guide may also help: https://t.co/2onqlsMNnL.",[],0.0


In [None]:
print(len(SPOTTED_test_df))
SPOTTED_test_df.tail(5)

150000


Unnamed: 0,tweet_text,hashtags,target
149995,Centuries-old Good Shepherd ring recovered from shipwrecks off Israel https://t.co/R97shKaKAD https://t.co/nO5bbwwxiy,[],0.0
149996,"Here are the 100 best inventions of 2021 making the world better, smarter and a little more fun https://t.co/fpT7v4Ayf9",[],0.0
149997,RT @FarhanKVirk: #ArmyActForDawn Those who fulfill agenda against Pakistan should have no role in mainstream journalism https://t.co/Xsuo1K…,['ArmyActForDawn'],1.0
149998,@Australian_Navy long range frigate HMAS Anzac conducts an underway replenishment with USNS Tippecanoe. #FreeAndOpenIndoPacific https://t.co/6fmP1yBmaA,['freeandopenindopacific'],0.0
149999,"“Although many of us might think we are done with COVID-19, it’s not done with us.” World Health Organization Director-General Tedros Ghebreyesus warned Monday that the Omicron variant is the latest reminder the pandemic remains an ongoing global threat https://t.co/88ZH0B8xYq https://t.co/ahayiAGBp6",[],0.0


In [None]:
print(len(SPOTTED_validation_df))
SPOTTED_validation_df.head(10)

5000


Unnamed: 0,tweet_text,hashtags,target
0,RT @kodiak149: .@PetersonUtah deserves more followers \nFollow @PetersonUtah \nSupport @PetersonUtah \nElections matter \n#wtpBlue https://t.co…,['wtpBlue'],1.0
1,@unpuNISHAble_ @Desh3hunna like you'll avi's,[],1.0
2,"RT @MuslimIQ: Meanwhile millions of parents find it hard to cope that 1 in 6 kids go to bed hungry every night in America, &amp; the poverty th…",[],1.0
3,.@TheRock put his #RedNotice co-stars @GalGadot and @VancityReynolds to the test in our first #MuseumFaceOff! 🎨 Follow @GoogleArts and find out which one of them can tell their Michelangelo from their Van Gogh. https://t.co/UIoTCadGzz @NetflixFilm https://t.co/ksE0Vg8zbR,"['rednotice', 'museumfaceoff']",0.0
4,"Parents who could work from home tried to multitask their way through, often at the cost of their productivity, sanity or both https://t.co/K0fzLY9kVe",[],0.0
5,RT @dagr8fm: New post: Chance the Rapper Finally Becomes a ‘Jeopardy!’ Answer https://t.co/T5GxaANo4C,[],1.0
6,"A merry Christmas to all, and to all a good night! https://t.co/FOcMKZM87x",[],0.0
7,CDC shortens isolation period for asymptomatic people who test positive for Covid. https://t.co/ugv8ibGFNi,[],0.0
8,"Congressman Elijah Cummings fought for the soul of America and will always be remembered as a giant in Congress. On the one year anniversary of his passing, may his legacy continue to shine brightly. https://t.co/0wFn4znfOJ",[],0.0
9,A Bloomberg Businessweek investigation shows popular ratings driving trillions into sustainable investing have little connection with a company's impact on the planet https://t.co/GN4pdoKcm2 via @BW,[],0.0
