### Pulling from Twitter

This is the notebook used to rehydrate the 2020 Voter Fraud Twitter dataset. Github size limits prevent me from  pushing the whole dataset to the repo. If you'd like to recreate the process download the dataset and point the program to wherever you've got the dataset stored locally. It may be necessary to pip install tweepy as well. Additionally, you'll need to create and use keys and tokens to interface with the twitter API. 

https://developer.twitter.com

#### This notebook will not run without setting up Twitter keys the dataset in the second cell.
#### This notebook will not run without a functional path to the right CSV in the third cell. 

### The final version of the data that I pulled from this notebook is saved in the Data folder of this repo as 'hydrated_tweets' 

You should be able to scroll down and run the last few cells of this notebook to see the end result.

The citation for this data is here:

A. Abilov, Y. Hua, H. Matatov, O. Amir and M. Naaman. (2021). VoterFraud2020: a Multi-modal Dataset of Election Fraud Claims on Twitter. To Appear, International Conference on Web and Social Media (ICWSM 2021).

In [5]:
import time
import random
import tweepy
import pandas as pd

The cell below is very important. First, the csv used here is too large to put in my repo, but can be found on the git repo for the VoterFraud2020 project. 

Then, and this is important, we cut the dataframe down to just community 2. This is the community that the VoterFraud2020 folks labeled as the most problematic. This community was associated with QAnon and had the highest rate of suspensions, so we're focusing the model on that group.

The VoterFraud2020 project also tracked whether or not users were still active. I limited the data to active users because I can't access tweets from banned or suspended users. 

In [15]:
# Getting the tweets for November 2020
nov_tweet_df = pd.read_csv('Data/tweets-2020-11.csv')
# Limiting to community 2 (promoting misinformation)
nov_tweet_df = nov_tweet_df[nov_tweet_df.user_community == 2]
# Reducing to active status (as of VF2020's reckoning.)
nov_tweet_df = nov_tweet_df[nov_tweet_df.user_active_status == 'active']
nov_tweet_df_full = nov_tweet_df.copy()

In [6]:
# This creates an instance of the tweepy client. 
# You'll need to drop your own keys and tokens in here to run on your own.

client = tweepy.Client(bearer_token=' ', 
                       consumer_key=' ', 
                       consumer_secret=' ', 
                       access_token=' ', 
                       access_token_secret=' ', 
                       wait_on_rate_limit=False)



In [7]:
def gettweettext(tweet_id_lists):
    # This function should take a list of tweet ids, get the text from Twitter
    # and save the resulting text as a tuple of tweet id, tweet text.
    # Make sure to create a tweepy Client instance called 'client' to use.
    output_list = []
    for i in tweet_id_lists:
        tweet = client.get_tweet(i)
        tweet_id = i
        tweet_tuple = (tweet_id, str(tweet.data)) 
        output_list.append(tweet_tuple)
    return output_list


In [9]:
# This should run if you've got a working tweepy client. The output should be:
# "[(1333561186168807426, "I watched the Hearing in Pennsylvania and cried.\n\nThen I watched 
# Arizona Hearing and wept!\n\nI'm a 74 year old man in Panama for 10 years and can't stand what I see. 
# I want retribution.\n\n@RudyGiuliani God Bless you for what you are doing to protect us all from this
# election Fraud.")]


# Getting the tweets for November 2020"
gettweettext([1333561186168807426])

[(1333561186168807426,
  "I watched the Hearing in Pennsylvania and cried.\n\nThen I watched Arizona Hearing and wept!\n\nI'm a 74 year old man in Panama for 10 years and can't stand what I see. I want retribution.\n\n@RudyGiuliani God Bless you for what you are doing to protect us all from this election Fraud.")]

In [16]:
nov_tweet_df_full

Unnamed: 0,tweet_id,user_community,user_active_status,retweet_count_metadata,quote_count_metadata,retweet_count_by_community_0,quote_count_by_community_0,retweet_count_by_community_1,quote_count_by_community_1,retweet_count_by_community_2,quote_count_by_community_2,retweet_count_by_community_3,quote_count_by_community_3,retweet_count_by_community_4,quote_count_by_community_4,retweet_count_by_suspended_users,quote_count_by_suspended_users
0,1322689823212212227,2.0,active,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1322689872243601409,2.0,active,0,0,0,0,0,0,0,0,0,0,0,0,0,0
13,1322689940799574017,2.0,active,0,0,0,0,0,0,0,0,0,0,0,0,0,0
21,1322690008478851078,2.0,active,0,0,0,0,0,0,0,0,0,0,0,0,0,0
22,1322690018368987136,2.0,active,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5067159,1333561186168807426,2.0,active,3,0,0,0,0,0,1,0,0,0,0,0,0,0
5067160,1333561267760586752,2.0,active,3,0,0,0,0,0,1,0,0,0,0,0,1,0
5067162,1333561357313003520,2.0,active,1,0,0,0,0,0,1,0,0,0,0,0,0,0
5067164,1333561400749338624,2.0,active,1,0,0,0,0,0,1,0,0,0,0,0,0,0


In [29]:
def gettweettest(list):
       
    output_list = []
    for i in list:
            tweet = client.get_tweet(i)
            tweet_id = i
            tweet_tuple = (tweet_id, str(tweet.data)) 
            test_list.append(tweet_tuple)
    return 'Done'


In [33]:
# Note! Make sure to load your saved DF before running this function!

def theonefunction(source_df, save_df, hours):

    # This function takes a source dataframe of tweet ids
    # And a dataframe into which you are trying to add the tweet id's and associated tweet texts 
    # This version of the function saves that dataframe as a csv file 'hydrated_tweets' 
   
    iterations = hours*4
    while iterations > 0:
        alreadygotitlist = save_df.index.tolist()
        working_df = source_df[~source_df.tweet_id.isin(alreadygotitlist)] 
        sample = working_df.tweet_id.sample(n = 300, replace = False, random_state= 14)
        id_list = sample.to_list()
        hydro_df = save_df
        for i in id_list:
            tweet = test.get_tweet(i)
            tid = i
            tweet_tuple = (tid, str(tweet.data)) 
            hydro_df = hydro_df.append({'tweet_id': tid,'tweet_text': str(tweet.data)}, ignore_index=True)
            hydro_df.to_csv('hydrated_tweets')
            
        iterations -= 1
                
            
        
    else:
        return "Done!"
        


In [34]:
hydration_station = pd.read_csv('hydrated_tweets')

In [35]:
hydration_station

Unnamed: 0.1,Unnamed: 0,tweet_id,tweet_text
0,0,1329817517590786054,I'm not saying I believe there was widespread ...
1,1,1330873630515896321,Yes of course! https://t.co/YqpjvH0NoS
2,2,1333262884357419010,@RepPaulMitchell @realDonaldTrump Are you real...
3,3,1331653544579981314,Unity! https://t.co/I625kseVMn
4,4,1329835459200114688,Listen to this https://t.co/uhKFLA3GVz
...,...,...,...
51610,51610,1331580447495696385,OMG https://t.co/QFx2mxyRmj
51611,51611,1324945867833552896,
51612,51612,1332903191575490562,
51613,51613,1325128552426057734,#election2020 #election2020results #electionre...


In [36]:
theonefunction(nov_tweet_df_full, hydration_station, .25)

hydration_station = pd.read_csv('hydrated_tweets')

l = ['tweet_id', 'tweet_text']

hydration_station = hydration_station[l]

hydration_station

Unnamed: 0,tweet_id,tweet_text
0,1329817517590786054,I'm not saying I believe there was widespread ...
1,1330873630515896321,Yes of course! https://t.co/YqpjvH0NoS
2,1333262884357419010,@RepPaulMitchell @realDonaldTrump Are you real...
3,1331653544579981314,Unity! https://t.co/I625kseVMn
4,1329835459200114688,Listen to this https://t.co/uhKFLA3GVz
...,...,...
51910,1331580447495696385,OMG https://t.co/QFx2mxyRmj
51911,1324945867833552896,
51912,1332903191575490562,
51913,1325128552426057734,#election2020 #election2020results #electionre...


In [38]:
def run_function_15_mins(hours):

# This function runs the function above every 15 minutes for however many hours you ask it to run. 
# I usually ran it over night, and made it a very talkative function so I could trouble shoot.
 
    
    i = hours*4
    while i > 0:
        time.sleep(60*15.1)
        hydration_station = pd.read_csv('hydrated_tweets')
        l = ['tweet_id', 'tweet_text']
        hydration_station = hydration_station[l]
        theonefunction(nov_tweet_df_full, hydration_station, .25)
        print("I think I have looped" )
        print(len(hydration_station))
        i -= 1
        print(i)
    print('Done.')
    return "Done!"

In [39]:
run_function_15_mins(13)

I think I have looped
51915
51
I think I have looped
52215
50
I think I have looped
52515
49
I think I have looped
52815
48
I think I have looped
53115
47
I think I have looped
53415
46
I think I have looped
53715
45
I think I have looped
54015
44
I think I have looped
54315
43
I think I have looped
54615
42
I think I have looped
54915
41
I think I have looped
55215
40
I think I have looped
55515
39
I think I have looped
55815
38
I think I have looped
56115
37
I think I have looped
56415
36
I think I have looped
56715
35
I think I have looped
57015
34
I think I have looped
57315
33
I think I have looped
57615
32
I think I have looped
57915
31
I think I have looped
58215
30
I think I have looped
58515
29
I think I have looped
58815
28
I think I have looped
59115
27
I think I have looped
59415
26
I think I have looped
59715
25
I think I have looped
60015
24
I think I have looped
60315
23
I think I have looped
60615
22
I think I have looped
60915
21
I think I have looped
61215
20
I think 

'Done!'

In [40]:
hydration_station = pd.read_csv('hydrated_tweets')


In [41]:
hydration_station

Unnamed: 0.1,Unnamed: 0,tweet_id,tweet_text
0,0,1329817517590786054,I'm not saying I believe there was widespread ...
1,1,1330873630515896321,Yes of course! https://t.co/YqpjvH0NoS
2,2,1333262884357419010,@RepPaulMitchell @realDonaldTrump Are you real...
3,3,1331653544579981314,Unity! https://t.co/I625kseVMn
4,4,1329835459200114688,Listen to this https://t.co/uhKFLA3GVz
...,...,...,...
67510,67510,1331580447495696385,OMG https://t.co/QFx2mxyRmj
67511,67511,1324945867833552896,
67512,67512,1332903191575490562,
67513,67513,1325128552426057734,#election2020 #election2020results #electionre...
