1.1 Getting data via Twitter APIs
======
**Get twitter data and save in files.**

**In the following examples, we showcase how to use both Twitter Search API and Twitter Streaming API to get data. Twitter Search API returns data already exists, while Twitter streaming API returns data from tweets as they happens in real time. "With a specific keyword, you can typically only poll the last 5,000 tweets per keyword. You are further limited by the number of requests you can make in a certain time period. The Twitter request limits have changed over the years but are currently limited to 180 requests in a 15 minute period." Read more at post: https://brightplanet.com/2013/06/25/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/ .**

**Import tweepy and other required python libraries for getting data from Twitter.**

_The following Python modules need to be installed if they are not in the environment (Run all to test if any module is missing) in order to run this notebook (run without ! under command line or with ! in the notebook, for SWAN, see https://support.aarnet.edu.au/hc/en-us/articles/360000668076-How-do-I-add-code-libraries-to-my-Notebook-):_

In [None]:
# comment out to run install, and install other modules with !pip if report missing, may need to shutdown/restart Kernel session in SWAN
#!pip install tweepy
#!pip install geopy

In [17]:
# import tweepy and other required python libraries
import tweepy
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

**Get Twitter developer key and secret from https://developer.twitter.com and enter the information below. See also https://www.slickremix.com/docs/how-to-get-api-keys-and-tokens-for-twitter/.**

In [18]:
# set twitter API information
consumer_key = '#######'
consumer_secret = '#######'
access_token = '#######'
access_token_secret = '#######'

**Create Tweepy API object from key and secret.**

In [19]:
# create Tweepy API object
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

**Create DataFrame to save twitter data.**

In [20]:
# create DataFrame from twitter data
df = pd.DataFrame(columns = ['text', 'user.name', 'user.statuses_count','user.screen_name',
                             'user.followers_count', 'user.location', 'user.verified','user.profile_image_url_https',
                             'favorite_count', 'retweet_count', 'created_at','hashtags','quoted_hashtags','user_lat','user_lng'])


**Prepare library and test to get gps information (latitude, longitude) from twitter data location.**

In [21]:
# prepare library to get gps information (latitude, longitude) from twitter data location 
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="test",timeout=10)

#loc = geolocator.geocode("New York, NY")
#loc

1.1.1 Use Twitter Search API to get data
------

**Set maximum number of tweets to get. Create the steam function to get twitter data and save in CSV file.**

**Example to export as single csv file for small dataset.**
Read more at: http://docs.tweepy.org/en/latest/api.html#API.search Note: count – The number of tweets to return per page, currently up to a maximum of 100.

In [22]:

# search function to get twitter data and save in csv, data-searchterms, file_name-csv file name to save, tweet_number-max number of tweets to get
def search_twitter(data, file_name, tweet_number=200):
    i = 0
    for tweet in tweepy.Cursor(api.search, q=data, count=100, lang='en').items():
        hashtags=[]
        quoted_hashtags=[]
        #check if hashtags exist
        if tweet.entities.get("hashtags")!=[]:
            hashtags = '['+','.join(pd.json_normalize(tweet.entities.get("hashtags"))['text'].tolist())+']'
        #check if it is quoted tweet and if hashtags exist in the quoted tweet
        if hasattr(tweet, 'quoted_status') and tweet.quoted_status.entities.get("hashtags")!=[]:
            quoted_hashtags = '['+','.join(pd.json_normalize(tweet.quoted_status.entities.get("hashtags"))['text'].tolist())+']'
        df.loc[i, 'text'] = tweet.text
        df.loc[i, 'user.name'] = tweet.user.name
        df.loc[i, 'user.statuses_count'] = tweet.user.statuses_count
        df.loc[i, 'user.screen_name'] = tweet.user.screen_name
        df.loc[i, 'user.followers_count'] = tweet.user.followers_count
        df.loc[i, 'user.location'] = tweet.user.location
        df.loc[i, 'user.verified'] = tweet.user.verified
        df.loc[i, 'user.profile_image_url_https'] = tweet.user.profile_image_url_https
        df.loc[i, 'favorite_count'] = tweet.favorite_count
        df.loc[i, 'retweet_count'] = tweet.retweet_count
        df.loc[i, 'created_at'] = tweet.created_at
        df.loc[i, 'hashtags']=hashtags
        df.loc[i, 'quoted_hashtags']=quoted_hashtags
        # process user location, check if user location exists first and then get lat/lng if it does
        if tweet.user.location is not None:
            coord = geolocator.geocode(tweet.user.location)
            if coord is not None:
                df.loc[i, 'user_lat'] = coord.latitude
                df.loc[i, 'user_lng'] = coord.longitude
        # save data frame to csv with the file_name
        df.to_csv('{}.csv'.format(file_name))
        i+=1
        # break the search if i reaches tweet_number - max number of tweets to get, break if it has, keep searching if it hasn't
        if i == tweet_number:
            break
        else:
            pass

**Set search terms for tweets and CSV file name to save the returned tweets.**

In [23]:
# maximum tweets to get
#tweet_number=200
# set twitter query for getting data, file_name to save data
search_twitter(data = ['digitalhumanities or #digitalhumanities'], file_name = 'dh_tweets')
# try different search terms and tweet number
#stream(data = ['covid or #covid'], file_name = 'cv_tweets', tweet_number=500)

**Check and inspect the DataFrame.**

In [24]:
# check dataframe
df.head()

Unnamed: 0,text,user.name,user.statuses_count,user.screen_name,user.followers_count,user.location,user.verified,user.profile_image_url_https,favorite_count,retweet_count,created_at,hashtags,quoted_hashtags,user_lat,user_lng
0,RT @_epidat: (2/3) 38k inscriptions from ~220 ...,Bot do Laboratório de Humanidades Digitais da ...,1219,BotLabhd,59,LABHDUFBA,False,https://pbs.twimg.com/profile_images/129869437...,0,1,2021-01-31 22:56:37,"[Jewish,cemeteries,epigraphic]",[],,
1,(2/3) 38k inscriptions from ~220 #Jewish #ceme...,epidat,3248,_epidat,332,,False,https://pbs.twimg.com/profile_images/916020636...,1,1,2021-01-31 22:56:31,"[Jewish,cemeteries,epigraphic]",[],,
2,RT @helenkbones: If you've heard me going on a...,Bot do Laboratório de Humanidades Digitais da ...,1219,BotLabhd,59,LABHDUFBA,False,https://pbs.twimg.com/profile_images/129869437...,0,11,2021-01-31 22:18:02,[],[],,
3,RT @juliannenyhan: I'm very excited for @ianmi...,Andryah Ayotte,10,AndryahAyotte,0,,False,https://pbs.twimg.com/profile_images/135493179...,0,7,2021-01-29 05:48:34,[digitalhumanities],[],,
4,RT @juliannenyhan: I'm very excited for @ianmi...,eResearch UCL,12028,eResearch_UCL,2992,"London, England",False,https://pbs.twimg.com/profile_images/731053704...,0,7,2021-01-27 17:59:33,[digitalhumanities],[],51.5073,-0.127647


**Example to export as multiple csv files for large dataset.**

In [25]:
# set maximum tweets to get
tweet_number=5000
# search function to get twitter data and save in csv, tweet_number-maximum tweets to get, tweetsperfile-number of tweets per file
def search_twitter_split(data, file_name, tweet_number=5000, tweetsperfile=1000):
    tweet_count = 0 # total tweets count
    filetweet_count = 0 # tweets in file count
    fileno = 0 # file number count
    # create DataFrame from twitter data
    df = pd.DataFrame(columns = ['text', 'user.name', 'user.statuses_count','user.screen_name',
                             'user.followers_count', 'user.location', 'user.verified','user.profile_image_url_https',
                             'favorite_count', 'retweet_count', 'created_at','hashtags','quoted_hashtags','user_lat','user_lng'])

    for tweet in tweepy.Cursor(api.search, q=data, count=100, lang='en').items():
        hashtags="[]"
        quoted_hashtags="[]"
        #check if hashtags exist
        if tweet.entities.get("hashtags")!=[]:
            hashtags = '['+','.join(pd.json_normalize(tweet.entities.get("hashtags"))['text'].tolist())+']'
        #check if it is quoted tweet and if hashtags exist in the quoted tweet
        if hasattr(tweet, 'quoted_status') and tweet.quoted_status.entities.get("hashtags")!=[]:
            quoted_hashtags = '['+','.join(pd.json_normalize(tweet.quoted_status.entities.get("hashtags"))['text'].tolist())+']'
        #print(hashtags)
        df.loc[filetweet_count, 'text'] = tweet.text
        df.loc[filetweet_count, 'user.name'] = tweet.user.name
        df.loc[filetweet_count, 'user.statuses_count'] = tweet.user.statuses_count
        df.loc[filetweet_count, 'user.screen_name'] = tweet.user.screen_name
        df.loc[filetweet_count, 'user.followers_count'] = tweet.user.followers_count
        df.loc[filetweet_count, 'user.location'] = tweet.user.location
        df.loc[filetweet_count, 'user.verified'] = tweet.user.verified
        df.loc[filetweet_count, 'user.profile_image_url_https'] = tweet.user.profile_image_url_https
        df.loc[filetweet_count, 'favorite_count'] = tweet.favorite_count
        df.loc[filetweet_count, 'retweet_count'] = tweet.retweet_count
        df.loc[filetweet_count, 'created_at'] = tweet.created_at
        df.loc[filetweet_count, 'hashtags']=hashtags
        df.loc[filetweet_count, 'quoted_hashtags']=quoted_hashtags
        # process user location, check if user location exists first and then get lat/lng if it does, uncomment to get latitude longitude data from geolocator,
        # please note this may take a long time depends on the number of tweet, see nominatim usage policy: https://operations.osmfoundation.org/policies/nominatim/, eg: No heavy uses (an absolute maximum of 1 request per second).
        '''if tweet.user.location is not None:
            coord = geolocator.geocode(tweet.user.location)
            sleep(1)
            if coord is not None:
                df.loc[filetweet_count, 'User_lat'] = coord.latitude
                df.loc[filetweet_count, 'User_lng'] = coord.longitude'''
        # check number of tweets in the dataframe reaches tweetsperfile, save it to csv file if it has
        if(filetweet_count>=tweetsperfile-1):
            df.to_csv(file_name+"_"+str(fileno)+".csv",index = False)         
            df = pd.DataFrame(columns = ['text', 'user.name', 'user.statuses_count','user.screen_name' 
                             'user.followers_count', 'user.location', 'user.verified','user.default_profile_image'
                             'favorite_count', 'retweet_count', 'created_at','hashtags','quoted_hashtags','user_lat','user_lng'])

            tweet_count+=1 # add 1 to total tweets count
            fileno+=1 # add 1 to file number count
            filetweet_count = 0 # reset tweets in file count
        # check if total tweets retrieved reaches tweet_number, break search and save the remaining to csv file if it has
        elif tweet_count >= tweet_number-1:
            if(filetweet_count!=0):
                df.to_csv(file_name+"_"+str(fileno)+".csv",index = False)
            break         
        else:
            tweet_count+=1  # add 1 to total tweets count
            filetweet_count+=1 # add 1 to tweets in file count
            pass # keep searching

**Set search terms for tweets and CSV file name to save the returned tweets. File number will be automatically added after file_name.**

In [26]:
# set twitter query for getting data, file_name to save data
search_twitter_split(data = ['datascience','#datascience','bigdata','#bigdata','ai','#ai'], file_name = 'dba_tweets',tweet_number=tweet_number)

1.1.2 Use Twitter Streaming API to get data
------

In [5]:
# import tweepy and other required python libraries
import tweepy
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy import API
import time
import csv
import sys
import json

**Define a function for customised Stream Listener class. Note: maxnotweets - maximum number of tweets to get, tweetsperfile - maximum number of tweets in each file; The function only get quoted tweets containing hashtags and quoted tweet hashtags, remove or change the condition if getting quoted tweets with hashtags and quoted tweet hashtags is not important.**

In [6]:
# Customised Stream Listener class Note: The function below only get quoted tweets containing hashtags and quoted tweet hashtags, remove or change the condition if getting quoted tweets with hashtags and quoted tweet hashtags is not important
class CustomListener(StreamListener):
    
    # init function for the listener api - not used in this use case, maxnotweets - maximum number of tweets to get, tweetsperfile - maximum number of tweets in each file
    def __init__(self, api = None,maxnotweets=100,tweetsperfile=30):
        self.api = api
        self.counter = 0 # counter for no of tweet per file
        self.maxcounter = 0 # counter for max no of tweets to get
        self.max = maxnotweets # maximum number of total tweets to get
        self.perfile = tweetsperfile # maximum number of tweets in each file
        # create a file with 'tweets_' and the current time
        self.filename = 'tweets'+'_'+time.strftime('%Y%m%d-%H%M%S')+'.csv' # first file name to save
        # create a new file with the write handle and the file name  
        csvFile = open(self.filename, 'w')
        
        # create a csv writer
        csvWriter = csv.writer(csvFile)
        
        # use writeheader function to write the headers of the columns
        self.writeHeader(csvWriter)
        
        # close the csv file
        csvFile.close()
    
    #Define a function to write header
    def writeHeader(self,csvWriter):
        # Write a single row with the headers of the columns
        csvWriter.writerow(['text',
                            'created_at',
                            'geo',
                            'lang',
                            'place',
                            'hashtags',
                            'coordinates',
                            'user.description',
                            'user.location',
                            'user.id',
                            'user.created_at',
                            'user.url',
                            'user.followers_count',
                            'user.default_profile_image',
                            'user.utc_offset',
                            'user.name',
                            'user.lang',
                            'user.screen_name',
                            'user.geo_enabled',
                            'user.time_zone',
                            'id',
                            'favorite_count',
                            'retweeted',
                            'source',
                            'favorited',
                            'retweet_count',
                            'quoted_status',
                            'quoted_text',
                            'quoted_hashtags'])
    
    
    # Called when a new status/tweet arrives
    def on_status(self, status):
        # check if the counter of maximum total number of tweets is reached, return False to on_data function and break the streaming. 
        if self.maxcounter >= self.max:
            return False
        # open the current csv file 
        csvFile = open(self.filename, 'a')
        
        # check if the counter of maximum number of tweets per file is reached, output data to csv and create a new csv file
        if self.counter >= self.perfile:
            # close the csv file
            csvFile.close()
             # create a new file with 'tweet_' and the current time
            self.filename = 'tweets'+'_'+time.strftime('%Y%m%d-%H%M%S')+'.csv'
            # create a new file with write handle and the filename
            csvFile = open(self.filename, 'w')
            
            # create a csv writer
            csvWriter = csv.writer(csvFile)
        
             # use writeheader function to write the headers of the columns
            self.writeHeader(csvWriter)
            
            # reset counter for maximum number of tweets per file
            self.counter = 0
            
        
        
        # create a csv writer
        csvWriter = csv.writer(csvFile)
        try:
            #check if it is quoted tweet and if hashtags exist in the quoted tweet and original tweet, remove or change this condition if quoted tweets with both hashtags and quoted tweet hashtags is not important
            if hasattr(status, 'quoted_status') and status.entities.get("hashtags")!=[] and status.quoted_status.entities.get("hashtags")!=[]:
                # transfer and normalize hashtags format 
                if status.entities.get("hashtags")!=[]:
                    hashtags = '['+','.join(pd.json_normalize(status.entities.get("hashtags"))['text'].tolist())+']'
                #check if it is quoted tweet and if hashtags exist in the quoted tweet
                if hasattr(status, 'quoted_status') and status.quoted_status.entities.get("hashtags")!=[]:
                    quoted_hashtags = '['+','.join(pd.json_normalize(status.quoted_status.entities.get("hashtags"))['text'].tolist())+']'
                # write the tweet's information to the csv file
                csvWriter.writerow([status.text,
                                        status.created_at,
                                        status.geo,
                                        status.lang,
                                        status.place,
                                        hashtags,
                                        status.coordinates,
                                        status.user.description,
                                        status.user.location,
                                        status.user.id,
                                        status.user.created_at,
                                        status.user.url,
                                        status.user.followers_count,
                                        status.user.default_profile_image,
                                        status.user.utc_offset,
                                        status.user.name,
                                        status.user.lang,
                                        status.user.screen_name,
                                        status.user.geo_enabled,
                                        status.user.time_zone,
                                        status.id,
                                        status.favorite_count,
                                        status.retweeted,
                                        status.source,
                                        status.favorited,
                                        status.retweet_count,
                                        status.quoted_status,
                                        status.quoted_status.text,
                                        quoted_hashtags
                                   ])
                # add 1 to counters
                self.counter += 1
                self.maxcounter += 1

        # If error happens
        except Exception as e:
            # Print the error
            print(e)
            # and continue
            pass    
        # Close the csv file
        csvFile.close()

        return True

    # called when a non-200/success status code is returned
    def on_error(self, status_code):
        # print the status code
        print('Encountered error with status code:', status_code)
        
        # when status code is 401, which is the error for bad credentials
        if status_code == 401:
            # end the stream
            return False

    # called when a delete notice arrives for a status/tweet
    def on_delete(self, status_id, user_id):
        
        # print delete message
        print("Deleted notice")
        
        return

    # called when a limitation notice arrives/reach the rate limit
    def on_limit(self, track):
        
        # print rate limiting message
        print("Rate limited...")
        
        # continue mining tweets
        return True

    # called when stream connection times out
    def on_timeout(self):
        
        # print timeout message
        print(sys.stderr, 'Timeout...')
        
        # wait 10 seconds
        time.sleep(10)
        
        return
    
    # called when twitter sends a disconnect notice
    def on_disconnect(self):
        
        # print timeout message
        print('Twitter disconnected...')
       
        return False

**Define a function for twitter streaming. Note: queries-defined search terms, maxnotweets - maximum number of tweets to get, tweetsperfile - maximum number of tweets in each file.**

In [7]:
# define a function for twitter streaming 
def stream_twitter(queries,maxnotweets=100,tweetsperfile=30):

    
    # set twitter API credentials information
    consumer_key = '#######'
    consumer_secret = '#######'
    access_token = '#######'
    access_token_secret = '#######'

    # create the authorization handler object
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    
    # set up the API with the authorization handler
    api = API(auth)
    # use the streaming listener
    listener = CustomListener(api,maxnotweets,tweetsperfile)

    
    # create a streaming object with the custom listener and authorization
    stream = Stream(auth, listener)

    # run the streaming using the user defined search terms
    stream.filter(track=queries)

**Start the twitter streaming for search terms. It may take a long period of time depends on the maxnotweets and how frequent the search terms are mentioned on twitter.**

In [8]:
# start the twitter streaming for search terms, eg 'digitalhumanities or #digitalhumanities', 'covid' or '#covid'
stream_twitter(['covid', '#covid'])

### Reference:
**How to get API Keys and Tokens for Twitter<br/>
https://www.slickremix.com/docs/how-to-get-api-keys-and-tokens-for-twitter/<br/>
Stream Tweets in Under 15 Lines of Code + Some Interactive Data Visualization<br/>
https://dzone.com/articles/stream-tweets-the-easy-way-in-under-15-lines-of-co<br/>
Twitter Firehose vs. Twitter API: What’s the difference and why should you care?<br/>
https://brightplanet.com/2013/06/25/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/ <br/>
Twitter Data Visualisation<br/>
https://nbviewer.jupyter.org/github/SantaDS/DataVisualisation/blob/master/TwitterDataAnalysis/twitter_data_analysis.ipynb<br/>
Mine Twitter's Stream For Hashtags Or Words<br/>
https://chrisalbon.com/python/other/mine_a_twitter_hashtags_and_words/<br/>
Twitter Data Visualisation<br/>
https://www.kaggle.com/tuncbileko/twitter-data-visualisation/<br/>
Tweepy API Reference<br/>
http://docs.tweepy.org/en/latest/api.html#API.search<br/>
Tweepy Streaming<br/>
https://github.com/tweepy/tweepy/blob/78d2883a922fa5232e8cdfab0c272c24b8ce37c4/tweepy/streaming.py<br/>
Twitter API<br/>
https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/user<br/>**