# Political Troll Tweets Analysis: Data Sampling
---

**<u>_Objective:_</u>** In this project, we perform exploratory data analysis on Russian, Chinese and Indonesian information operations, to uncover the trolls' tradecraft and modus operandi against a target populace.

This notebook performs data sampling on the raw datasets, to obtain a dataset to be used in the project


#### Data Collection
Do only this part for the data collection and aggregation

In [1]:
# import modules and dependencies
import numpy as np
import pandas as pd
import os
import shutil as sh

from data_collection_utility import *

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

### Russia

Remember to rerun this cell

In [2]:
import os
cur_dir = os.getcwd()
print(cur_dir[:-24])

#path = 'H:/My Drive/State-linked Information Ops Analysis/data/'
path = 'C:/Users/jh.quek/Documents/SPOTTED Data Collection/data2/'

G:\My Drive\State-linked Information Ops Analysis


In [3]:
state_actors = ['Russia', 'China']

for state_actor in state_actors:
    
    datasets = os.listdir(path + state_actor)
    print('List of datasets for state actor', state_actor, '\n', datasets)
    combined_df = dataset_fusion(path + state_actor + '/',  datasets)

    # pick half a random sample of the total dataframe with only English tweets
    combined_sample_df = combined_df[combined_df['tweet_language'] == 'en']
    combined_sample_df = combined_sample_df.sample(frac = 0.5, random_state = 46)
    
    # do a random shuffle for the combined dataset
    combined_sample_df = combined_sample_df.sample(frac = 1.0, random_state = 42)
    
    print('Writing to csv file ...')
    combined_sample_df.to_csv(state_actor +  '_Sample.csv')
    print('[*]----------------------------------------------- Complete -----------------------------------------------[*]\n')

List of datasets for state actor Russia 
 ['Russia_GRU_Feb_2021.csv', 'Russia_IRA_Feb_2021.csv', 'Russia_IRA_Oct_2018.csv', 'Russia_Jan_2019.csv', 'Russia_May_2020.csv']
Length of Russia_GRU_Feb_2021.csv dataframe is 26684
Length of Russia_IRA_Feb_2021.csv dataframe is 68914
Length of Russia_IRA_Oct_2018.csv dataframe is 8768633
Length of Russia_Jan_2019.csv dataframe is 920761
Length of Russia_May_2020.csv dataframe is 3434792
Length of merged dataframe is 13219784, [True]
[*]--------------------------------------      SUCCESS      --------------------------------------[*]

Writing to csv file ...
[*]----------------------------------------------- Complete -----------------------------------------------[*]

List of datasets for state actor China 
 ['China_Changyu_Culture_Dec_2021.csv', 'China_May_2020.csv', 'China_S1_Aug_2019.csv', 'China_S2_Aug_2019.csv', 'China_S3_Sept_2019.csv', 'China_Xinjiang_Dec_2021.csv']
Length of China_Changyu_Culture_Dec_2021.csv dataframe is 35924
Length of

### China

For China, we only pick two folders:
- China Changyu Culture
- China XinJiang

In [3]:

os.listdir(cur_dir[:-24] + '\\data\\China Changyu Culture')

full_path = [os.path.join(cur_dir[:-24] + '\\data\\China Changyu Culture', file) for file in os.listdir(cur_dir[:-24] + '\\data\\China Changyu Culture')]
full_path

['G:\\My Drive\\State-linked Information Ops Analysis\\data\\China Changyu Culture\\CNCC_0621_tweets_csv_hashed_2012.csv',
 'G:\\My Drive\\State-linked Information Ops Analysis\\data\\China Changyu Culture\\CNCC_0621_tweets_csv_hashed_2013.csv',
 'G:\\My Drive\\State-linked Information Ops Analysis\\data\\China Changyu Culture\\CNCC_0621_tweets_csv_hashed_2014.csv',
 'G:\\My Drive\\State-linked Information Ops Analysis\\data\\China Changyu Culture\\CNCC_0621_tweets_csv_hashed_2015.csv',
 'G:\\My Drive\\State-linked Information Ops Analysis\\data\\China Changyu Culture\\CNCC_0621_tweets_csv_hashed_2016.csv',
 'G:\\My Drive\\State-linked Information Ops Analysis\\data\\China Changyu Culture\\CNCC_0621_tweets_csv_hashed_2017.csv',
 'G:\\My Drive\\State-linked Information Ops Analysis\\data\\China Changyu Culture\\CNCC_0621_tweets_csv_hashed_2018.csv',
 'G:\\My Drive\\State-linked Information Ops Analysis\\data\\China Changyu Culture\\CNCC_0621_tweets_csv_hashed_2019.csv',
 'G:\\My Drive\\

In [28]:
os.listdir(cur_dir[:-24] + '\\data\\China Changyu Culture')
#China_Changyu_Culture_df = [pd.read_csv(file) for file in os.listdir(cur_dir[:-24] + '\\data\\China Changyu Culture')]

['CNCC_0621_tweets_csv_hashed_2012.csv',
 'CNCC_0621_tweets_csv_hashed_2013.csv',
 'CNCC_0621_tweets_csv_hashed_2014.csv',
 'CNCC_0621_tweets_csv_hashed_2015.csv',
 'CNCC_0621_tweets_csv_hashed_2016.csv',
 'CNCC_0621_tweets_csv_hashed_2017.csv',
 'CNCC_0621_tweets_csv_hashed_2018.csv',
 'CNCC_0621_tweets_csv_hashed_2019.csv',
 'CNCC_0621_tweets_csv_hashed_2020.csv',
 'CNCC_0621_tweets_csv_hashed_2021.csv']

In [8]:
China_CC_path = cur_dir[:-24] + '\\data\\China Changyu Culture'
China_XJ_path = cur_dir[:-24] + '\\data\\China Xinjiang'

China_CC_full_path = [os.path.join(China_CC_path, file) for file in os.listdir(China_CC_path)]
China_XJ_full_path = [os.path.join(China_XJ_path, file) for file in os.listdir(China_XJ_path)]

China_Changyu_Culture_df = [pd.read_csv(file) for file in China_CC_full_path]
China_XinJiang_df = [pd.read_csv(file) for file in China_XJ_full_path]


China_df = China_Changyu_Culture_df + China_XinJiang_df

# concatenate the dataframe
combined_df = pd.concat(China_df)


# pick half a random sample of the total dataframe with only English tweets
combined_sample_df = combined_df[combined_df['tweet_language'] == 'en']
#combined_sample_df = combined_sample_df.sample(frac = 0.5, random_state = 46)

# do a random shuffle for the combined dataset
combined_sample_df = combined_sample_df.sample(frac = 1.0, random_state = 42)

print('Writing to csv file ...')
combined_sample_df.to_csv('China_CCXJ' +  '_Sample.csv')
print('[*]----------------------------------------------- Complete -----------------------------------------------[*]\n')

Writing to csv file ...
[*]----------------------------------------------- Complete -----------------------------------------------[*]

