# **CT-NASA** (**C**rowd**T**angle-**N**ew **A**ctor **S**earching **A**lgorithm)

This is the ipython notebook that exhibits the work flow of CT-NASA, which searches and finds new pages or groups within the entire CrowdTangle database, given a set of URLs by the user.



**Research goal:** Given a list of actors, we want to search and identify new actors (i.e. groups or pages) that are (or might be) part of the information network through certain behevaiors, i.e., link sharing, comments, messages, post description etc.


In this algorithm, we use the PyCrowdTangle Python wrapper for CrowdTangle API. See the following links for more info on that project. 

**Pypi:** https://pypi.org/project/PyCrowdTangle/

**Github:** https://github.com/UPB-SS1/PyCrowdTangle

## Install PyCrowdTangle and import libraries

In [None]:
!pip install PyCrowdTangle -q

Import Libraries

In [None]:
import PyCrowdTangle as pct
import pandas as pd

In [None]:
dir(pct)

['PyCrowdTangle',
 '__author__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'ct_get_links',
 'ct_get_lists',
 'ct_get_posts']

In [None]:
# get version
pct.__version__

'0.5.0'

In [None]:
# get the api_token from https://apps.crowdtangle.com/
# you can locate your API token via your crowdtangle dashboard
# under Settings > API Access.
token="XYZZZZZZYYYYYYXXXXXXUUUUUWWWWW" #put your token here

## Load CrowdTangle dataset 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
os.chdir("/content/drive/MyDrive/YOUR_DIRECTORY_PATH")

In [None]:
import pandas as pd
import time

In [None]:
# put the data file name below
csv_data = pd.read_csv("data_file.csv", low_memory=False,  lineterminator='\n', sep=';', error_bad_lines=False)

In [None]:
csv_data.shape

(315169, 40)

In [None]:
csv_data.columns

Index(['account.name', 'account.handle', 'platformId', 'Page Category',
       'Page Admin Top Country', 'Page Description', 'Page Created',
       'subscriberCount', 'Followers at Posting', 'date', 'Post Created Date',
       'Post Created Time', 'type', 'totalInteraction',
       'statistics.actual.likeCount', 'statistics.actual.commentCount',
       'statistics.actual.shareCount', 'statistics.actual.loveCount',
       'statistics.actual.wowCount', 'statistics.actual.hahaCount',
       'statistics.actual.sadCount', 'statistics.actual.angryCount',
       'statistics.actual.careCount', 'Video Share Status', 'Is Video Owner?',
       'statistics.actual.videoPostViewCount',
       'statistics.actual.videoTotalViewCount',
       'statistics.actual.videoAllCrosspostsViewCount', 'Video Length',
       'postUrl', 'message', 'expandedLinks.original',
       'expandedLinks.expanded', 'imageText', 'title', 'description',
       'brandedContentSponsor.platformId', 'brandedContentSponsor.name',
 

## Pre-processing of the dataset

Let's identify the unique actors present in the loaded dataset and create a list out of them.

In [None]:
actors_list = csv_data['account.name'].dropna().unique()
print ("Total number of unique actors within the dataset:", actors_list.size)

for i in range (actors_list.size):
  print (actors_list [i]) 

Let us now identify the top links/URLs present in the dataset

In [None]:
# top links 

csv_data ['expandedLinks.original'].value_counts()

https://www.facebook.com/hanumansinghsirana/videos/1951535514949751/                              28
http://www.akhandbharatimes.com/                                                                  20
https://janganapp.page.link/X42f                                                                  19
https://www.facebook.com/pushpendrakuldelhi001/videos/332266691549294/                            15
https://sachkhabar.co.in/now-biden-wants-modis-help-immediately-only-india-can-save-the-world/    13
                                                                                                  ..
https://www.facebook.com/pradeepBhajpa/videos/434140024317184/                                     1
https://www.facebook.com/photo.php?fbid=1844342252408908&set=p.1844342252408908&type=3             1
https://www.facebook.com/photo.php?fbid=2086761884799509&set=gm.2009189855886754&type=3            1
https://www.facebook.com/203867673485517/photos/a.203869050152046/952309761974634/?type=3  

Select a sub-set of the top URLs

In [None]:
# top N links
N=8 # choose any number for N

URL_list = csv_data ['expandedLinks.original'].dropna().value_counts() [:N].index.tolist()

import numpy as np
for i in range (np.size(URL_list)):
  print (URL_list[i]) 

https://www.facebook.com/hanumansinghsirana/videos/1951535514949751/
http://www.akhandbharatimes.com/
https://janganapp.page.link/X42f
https://www.facebook.com/pushpendrakuldelhi001/videos/332266691549294/
https://sachkhabar.co.in/now-biden-wants-modis-help-immediately-only-india-can-save-the-world/
https://www.facebook.com/251541358337843
https://sachkhabar.co.in/modi-governments-big-blow-to-zakir-naik/
https://www.facebook.com/462116500605383


## Use *ct_get_links* function to retrieve a set of posts matching a set of URLs

In [None]:
print(pct.ct_get_links.__doc__)

 Retrieve a set of posts matching a certain link.

    Args:
        link (str): The link to query by. Required.
        platforms (str, optional): The platforms from which to retrieve links. This value can be comma-separated.
                                   options: facebook, instagram, reddit. Defaults to 'facebook'.
        count (int, optional): The number of posts to return. Defaults to 100. options [1-100]
        start_date (str, optional): The earliest date at which a post could be posted. Time zone is UTC. 
                                    Format is “yyyy-mm-ddThh:mm:ss” or “yyyy-mm-dd” 
                                    (defaults to time 00:00:00).
        end_date (str, optional):  The latest date at which a post could be posted.
                                  Time zone is UTC. Format is “yyyy-mm-ddThh:mm:ss”
                                  or “yyyy-mm-dd” (defaults to time 00:00:00).
                                  Defaults to "now".
        include_history (

Let's now define the function to get all accounts who are associated with a link

In [None]:
def get_all_posts (URL, start_date, api_token):
  data = pct.ct_get_links(link=URL, include_history = 'true', platforms= ('facebook'), start_date=start_date,api_token=api_token)
  df = pd.DataFrame(data['result']['posts'])
  return df

Function to extract particular account details from the dictionary

In [None]:
def get_dict (df, output_df):
  #keys_to_extract = ['name', 'handle', 'profileImage', 'subscriberCount', 'url', 'platform', 'platformId', 'accountType', 'pageAdminTopCountry', 'pageDescription', 'pageCreatedDate', 'pageCategory', 'verified']
  for i in range (len(df)):
    #account_subset = {key: account_df[i] [key] for key in keys_to_extract}
    
    #platformId_dict = {'platformId' : df['platformId'][i]}
    date_dict = {'date' : df['date'][i]}
    updated_dict = {'updated' : df['updated'][i]}
    account_dict = df['account'][i]
    #caption_dict = {'caption' : df['caption'][i]}
    message_dict = {'message' : df['message'][i]}
    link_dict = {'link' : df['link'][i]}
    postUrl_dict = {'postUrl' : df['postUrl'][i]}

    extracted_dict = {**date_dict, **updated_dict, **message_dict, **link_dict, **postUrl_dict, **account_dict}
    
    extracted_dict_df = pd.DataFrame([extracted_dict])
    output_df = pd.concat ([output_df, extracted_dict_df], ignore_index=True)
  return output_df

In [None]:
start_date = '2019-01-01'

output_df = pd.DataFrame()

for i in range (np.size(URL_list)):
  df = get_all_posts (str(URL_list[i]), start_date, token)
  output_df = get_dict (df, output_df)

  if i < (np.size(URL_list)-1):
    time.sleep (31)

print (output_df)

                    date              updated  \
0    2022-05-07 03:03:01  2022-05-08 18:32:54   
1    2022-04-10 13:37:50  2022-05-02 17:23:42   
2    2022-04-10 13:37:49  2022-04-13 14:49:45   
3    2022-04-10 13:37:48  2022-05-03 15:45:48   
4    2022-04-10 13:37:47  2022-05-05 07:29:45   
..                   ...                  ...   
456  2021-11-16 13:51:28  2022-03-07 16:55:56   
457  2021-11-16 13:51:22  2022-04-02 18:54:39   
458  2021-11-16 13:46:20  2022-04-08 01:35:08   
459  2021-11-16 13:40:01  2021-12-13 07:24:16   
460  2021-11-16 13:40:00  2021-12-11 11:56:17   

                                               message  \
0                                                  NaN   
1                                                  NaN   
2                                                  NaN   
3                                                  NaN   
4                                                  NaN   
..                                                 ...   
456  

## Post-processing of the dataset

Drop the actors that match the original actors list 

In [None]:
output_new_actors = output_df[~output_df['name'].isin(actors_list)]
print (output_new_actors)

                    date              updated  \
0    2022-05-07 03:03:01  2022-05-08 18:32:54   
1    2022-04-10 13:37:50  2022-05-02 17:23:42   
2    2022-04-10 13:37:49  2022-04-13 14:49:45   
3    2022-04-10 13:37:48  2022-05-03 15:45:48   
4    2022-04-10 13:37:47  2022-05-05 07:29:45   
..                   ...                  ...   
455  2021-11-16 13:51:36  2021-12-17 00:04:43   
456  2021-11-16 13:51:28  2022-03-07 16:55:56   
458  2021-11-16 13:46:20  2022-04-08 01:35:08   
459  2021-11-16 13:40:01  2021-12-13 07:24:16   
460  2021-11-16 13:40:00  2021-12-11 11:56:17   

                                               message  \
0                                                  NaN   
1                                                  NaN   
2                                                  NaN   
3                                                  NaN   
4                                                  NaN   
..                                                 ...   
455  

Let's now drop duplicates from the list, so that we are left with only unique names

In [None]:
output_new_actors_unique = output_new_actors.drop_duplicates("url", keep='first', ignore_index=True)
print (output_new_actors_unique)

                    date              updated  \
0    2022-05-07 03:03:01  2022-05-08 18:32:54   
1    2022-04-10 13:37:50  2022-05-02 17:23:42   
2    2022-04-10 13:37:49  2022-04-13 14:49:45   
3    2022-04-10 13:37:48  2022-05-03 15:45:48   
4    2022-04-10 13:37:47  2022-05-05 07:29:45   
..                   ...                  ...   
185  2021-11-16 14:03:38  2021-12-06 21:15:36   
186  2021-11-16 13:58:28  2022-03-31 12:17:41   
187  2021-11-16 13:56:56  2021-12-13 07:24:15   
188  2021-11-16 13:55:54  2021-12-10 11:08:19   
189  2021-11-16 13:54:54  2021-12-11 11:07:28   

                                               message  \
0                                                  NaN   
1                                                  NaN   
2                                                  NaN   
3                                                  NaN   
4                                                  NaN   
..                                                 ...   
185  

Print the set of unique new actors

In [None]:
print ("Total number of newly found actors is:", len(output_new_actors_unique))
output_new_actors_unique.head()

Total number of newly found actors is: 190


Unnamed: 0,date,updated,message,link,postUrl,id,name,profileImage,subscriberCount,url,platform,platformId,accountType,pageCategory,verified,handle,pageAdminTopCountry,pageDescription,pageCreatedDate
0,2022-05-07 03:03:01,2022-05-08 18:32:54,,https://www.facebook.com/hanumansinghsirana/vi...,https://www.facebook.com/groups/31327128233932...,9062940,The Kapil Sharma Fan Club,https://scontent-sea1-1.xx.fbcdn.net/v/t1.6435...,647586,https://www.facebook.com/groups/313271282339324,Facebook,313271282339324,facebook_group,none,False,,,,
1,2022-04-10 13:37:50,2022-05-02 17:23:42,,https://www.facebook.com/hanumansinghsirana/vi...,https://www.facebook.com/groups/79167731427022...,10171991,BJP HYDERABAD,https://scontent-sea1-1.xx.fbcdn.net/v/t1.6435...,4782,https://www.facebook.com/groups/791677314270224,Facebook,791677314270224,facebook_group,none,False,,,,
2,2022-04-10 13:37:49,2022-04-13 14:49:45,,https://www.facebook.com/hanumansinghsirana/vi...,https://www.facebook.com/groups/25436293726112...,15738239,PVC PIPE MANUFACTURING,https://scontent-sea1-1.xx.fbcdn.net/v/t1.6435...,323011,https://www.facebook.com/groups/2543629372611245,Facebook,2543629372611245,facebook_group,none,False,,,,
3,2022-04-10 13:37:48,2022-05-03 15:45:48,,https://www.facebook.com/hanumansinghsirana/vi...,https://www.facebook.com/groups/19988801035048...,9070218,भारतीय जनता पार्टी विधानसभा क्षेत्र भीनमाल विध...,https://scontent-sea1-1.xx.fbcdn.net/v/t1.6435...,31332,https://www.facebook.com/groups/1998880103504870,Facebook,1998880103504870,facebook_group,none,False,,,,
4,2022-04-10 13:37:47,2022-05-05 07:29:45,,https://www.facebook.com/hanumansinghsirana/vi...,https://www.facebook.com/groups/19441473592441...,5180687,Jagat News,https://scontent-sea1-1.xx.fbcdn.net/v/t1.6435...,20243,https://www.facebook.com/1944147359244100,Facebook,1944147359244100,facebook_group,none,False,,,,


Save the new unique actors *raw* list in a csv file (**optional**)

In [None]:
output_new_actors_unique.to_csv('/content/drive/MyDrive/YOUR_DIRECTORY_PATH/new_actors.csv') 

Let's now split the Facebook pages and groups within the list of newly found actors

In [None]:
output_new_actors_unique_pages = output_new_actors_unique.loc[output_new_actors_unique['accountType']=='facebook_page']
output_new_actors_unique_groups = output_new_actors_unique.loc[output_new_actors_unique['accountType']=='facebook_group']

Let's now get a list of those actors whose names contain a set of substring (i.e., media, news etc.). Here we are interested in the **media pages and groups**.

In [None]:
new_media_pages = output_new_actors_unique_pages.loc[output_new_actors_unique_pages['name'].str.contains("media|Media|News|news|Khabar|khabar", case=False)]
new_media_groups = output_new_actors_unique_groups.loc[output_new_actors_unique_groups['name'].str.contains("media|Media|News|news|Khabar|khabar", case=False)]

In [None]:
print ("Total number of newly found media pages is:", len(new_media_pages))
print ("Total number of newly found media groups is:", len(new_media_groups))

Total number of newly found media pages is: 2
Total number of newly found media groups is: 13


Let's now find out **non-media pages and groups** from the list

In [None]:
new_general_pages = output_new_actors_unique_pages.loc[~output_new_actors_unique_pages['name'].str.contains("media|Media|News|news|Khabar|khabar", case=False)]
new_general_groups = output_new_actors_unique_groups.loc[~output_new_actors_unique_groups['name'].str.contains("media|Media|News|news|Khabar|khabar", case=False)]

In [None]:
print ("Total number of newly found non-media pages is:", len(new_general_pages))
print ("Total number of newly found non-media groups is:", len(new_general_groups))

Total number of newly found non-media pages is: 30
Total number of newly found non-media groups is: 145


## Prepare the different lists of actors (pages and groups) for bulk upload to CrowdTangle

Prepare a csv of new **non-media** groups for bulk upload

In [None]:
new_general_groups.columns

Index(['date', 'updated', 'message', 'link', 'postUrl', 'id', 'name',
       'profileImage', 'subscriberCount', 'url', 'platform', 'platformId',
       'accountType', 'pageCategory', 'verified', 'handle',
       'pageAdminTopCountry', 'pageDescription', 'pageCreatedDate'],
      dtype='object')

In [None]:
new_general_groups_csv = new_general_groups[['url']]
new_general_groups_csv.rename (columns={'url': 'Page or Account URL'}, inplace=True)
new_general_groups_csv['List']='New Group Actors'
new_general_groups_csv.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Page or Account URL,List
0,https://www.facebook.com/groups/313271282339324,New Group Actors
1,https://www.facebook.com/groups/791677314270224,New Group Actors
2,https://www.facebook.com/groups/2543629372611245,New Group Actors
3,https://www.facebook.com/groups/1998880103504870,New Group Actors
6,https://www.facebook.com/groups/606417793051573,New Group Actors


In [None]:
new_general_groups_csv.to_csv('/content/drive/MyDrive/YOUR_DIRECTORY_PATH/new_gen_groups.csv', index=False) 

Prepare a csv of new **media** groups for bulk upload

In [None]:
new_media_groups_csv = new_media_groups[['url']]
new_media_groups_csv.rename (columns={'url': 'Page or Account URL'}, inplace=True)
new_media_groups_csv['List']='New Media Group Actors'
new_media_groups_csv.head()
new_media_groups_csv.to_csv('/content/drive/MyDrive/YOUR_DIRECTORY_PATH/new_media_groups.csv', index=False) 

Prepare a csv of new **non-media pages** for bulk upload

In [None]:
new_general_pages_csv = new_general_pages[['url']]
new_general_pages_csv.rename (columns={'url': 'Page or Account URL'}, inplace=True)
new_general_pages_csv['List']='New Page Actors'
new_general_pages_csv.head()
new_general_pages_csv.to_csv('/content/drive/MyDrive/YOUR_DIRECTORY_PATH/new_gen_pages.csv', index=False) 

Prepare a csv of new **media pages** for bulk upload

In [None]:
new_media_pages_csv = new_media_pages[['url']]
new_media_pages_csv.rename (columns={'url': 'Page or Account URL'}, inplace=True)
new_media_pages_csv['List']='New Media Page Actors'
new_media_pages_csv.head()
new_media_pages_csv.to_csv('/content/drive/MyDrive/YOUR_DIRECTORY_PATH/new_media_pages.csv', index=False) 