## Data Collection

https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets

In this library we use the Twitter standard search API, which returns a collection of relevant Tweets matching a specified query.

The Search API is not meant to be an exhaustive source of Tweets. **Not all Tweets will be indexed or made available via the search interface**.

https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/guides/working-with-timelines

The Twitter API has several methods, such as GET statuses/user_timeline and GET statuses/home_timeline, which return a timeline of Tweet data. Such timelines can grow very large, so there are limits to how much of a timeline a client application may fetch in a single request. Applications must therefore iterate through timeline results in order to build a more complete list.

In [24]:
# Data Collection Functions

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from twython import Twython


ACCESS_TOKEN = "761441357315440640-suCCQJo6kuufi3PmcYUl2y9kNyYb8C0"
ACCESS_TOKEN_SECRET = "nN4nX0LhlUZHN31LLYU1neOxg7elvb4LIo9KkX7gMDMaN"
API_KEY = "oMlZlYVi6MerYj7SZzcYWvgVr"
API_SECRET_KEY = "OW8cYRS69LUQ1gD5rKULGi4QtuBoj0OX5hRyJI5HVBbzTLZzam"


def collect_tweets(query='', geocode=None, result_type='recent',
                   num_of_page=20, count=100, since=None, until=None):
    '''Collects a number of tweets using Twitter standard search API and 
    returns a list of dictionaries each representing a tweet.

    query: search query
    geocode: Returns tweets by users located within a given radius 
             of the given lat/long. The parameter value is specified 
             by " latitude,longitude,radius "
    result_type: Specifies what type of search results you would prefer to receive. 
                  mixed : Include both popular and real time results in the response.
                  recent : return only the most recent results in the response
                  popular : return only the most popular results in the response.
    num_of_page: number of pages to collect.
    count: The number of tweets to return per page, up to a maximum of 100. 
           Defaults to 15.
    since: Returns tweets created after the given date. 
           Date should be formatted as YYYY-MM-DD. 
           The search index has a 7-day limit.
    until: Returns tweets created before the given date. 
           Date should be formatted as YYYY-MM-DD. 
           The search index has a 7-day limit.
    since_id: Returns results with an ID greater than 
              (that is, more recent than) the specified ID. 
              There are limits to the number of Tweets which 
              can be accessed through the API. If the limit of 
              Tweets has occured since the since_id, the since_id 
              will be forced to the oldest ID available.
    max_id: Returns results with an ID less than 
            (that is, older than) or equal to the specified ID.
    include_entities: The entities node will not be included when set to false.
    '''

    # Authentication
    twitter_obj = Twython(API_KEY, API_SECRET_KEY,
                          ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

    # Use Twitter standard API search
    tweet_result = twitter_obj.search(q=query, geocode=geocode,
                                      result_type=result_type, count=count,
                                      since=since, until=until,
                                      include_entities='true',
                                      tweet_mode='extended', lang='en')

    # In order to prevent redundant tweets explained here
    # https://developer.twitter.com/en/docs/tweets/timelines/guides/working-with-timelines
    # instead of reading a timeline relative to the top of the list
    # (which changes frequently), an application should read the timeline
    # relative to the IDs of tweets it has already processed.
    tweets_list = tweet_result['statuses'] 
    i = 0  # num of iteration through each page
    rate_limit = 1  # There is a limit of 100 API calls in the hour
    while tweet_result['statuses'] and i < num_of_page:
        if rate_limit < 1:
            # Rate limit time out needs to be added here in order to
            # collect data exceeding available rate-limit
            print(str(rate_limit)+' Rate limit!')
            break
        max_id = tweet_result['statuses'][len(
            tweet_result['statuses']) - 1]['id']-1

        tweet_result_per_page = twitter_obj.search(q=query, geocode=geocode,
                                                   result_type=result_type,
                                                   count=count, since=since,
                                                   until=until,
                                                   include_entities='true',
                                                   tweet_mode='extended',
                                                   lang='en',
                                                   max_id=str(max_id))

        tweets_list += tweet_result_per_page['statuses']
        i += 1
        rate_limit = int(twitter_obj.get_lastfunction_header(
            'x-rate-limit-remaining'))

    return tweets_list



def make_dataframe(tweets_list, search_term):
    '''Gets the list of tweets and return it as a pandas DataFrame.
    '''

    df = pd.DataFrame()
    df['tweet_id'] = list(map(lambda tweet: tweet['id'],
                              tweets_list))
    df['user'] = list(map(lambda tweet: tweet['user']
                          ['screen_name'], tweets_list))
    df['time'] = list(map(lambda tweet: tweet['created_at'], tweets_list))
    df['tweet_text'] = list(map(lambda tweet: tweet['full_text'], tweets_list))
    df['location'] = list(
        map(lambda tweet: tweet['user']['location'], tweets_list))
    df['hashtags'] = list(
        map(lambda tweet: tweet['entities']['hashtags'], tweets_list))
    df['search_term'] = list(map(lambda tweet: search_term if search_term.lower(
    ) in tweet['full_text'].lower() else None, tweets_list))

    return df


In [25]:
#test

query = 'Walmart'
#geocode="43.653226,-79.383184,100km" #Toronto
#geocode="49.525238,-93.874023,4000km" #North America
tweets_list = collect_tweets(query=query, geocode="49.525238,-93.874023,4000km")
df = make_dataframe(tweets_list, query)

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2095 entries, 0 to 2094
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   tweet_id     2095 non-null   int64 
 1   user         2095 non-null   object
 2   time         2095 non-null   object
 3   tweet_text   2095 non-null   object
 4   location     2095 non-null   object
 5   hashtags     2095 non-null   object
 6   search_term  1956 non-null   object
dtypes: int64(1), object(6)
memory usage: 114.7+ KB


In [27]:
pd.set_option('display.max_rows', 999)
pd.set_option('display.max_columns', 999)
df.head(400)

Unnamed: 0,tweet_id,user,time,tweet_text,location,hashtags,search_term
0,1328515200925003778,lglovinsky11,Tue Nov 17 01:48:01 +0000 2020,RT @QasimRashid: COVID19 has closed 20% of sma...,,[],Walmart
1,1328514902382817282,FRWessling,Tue Nov 17 01:46:49 +0000 2020,RT @QasimRashid: COVID19 has closed 20% of sma...,"Always Here, Sometimes Now",[],Walmart
2,1328514480712650752,PonytailPixie,Tue Nov 17 01:45:09 +0000 2020,RT @QasimRashid: COVID19 has closed 20% of sma...,Sparkle Mountain,[],Walmart
3,1328513128754270209,FanWalmart,Tue Nov 17 01:39:47 +0000 2020,RT @WalmartCAGaming: Happy Monday! A quick upd...,Walmart Canada :),"[{'text': 'PS5', 'indices': [135, 139]}]",Walmart
4,1328512926173589505,avmoca,Tue Nov 17 01:38:58 +0000 2020,RT @QasimRashid: COVID19 has closed 20% of sma...,,[],Walmart
5,1328512860721459201,MarthaLynneOwe1,Tue Nov 17 01:38:43 +0000 2020,RT @QasimRashid: COVID19 has closed 20% of sma...,Southern Maine,[],Walmart
6,1328512705557311488,tightenupjit,Tue Nov 17 01:38:06 +0000 2020,Bout to hit this Walmart truck for this ps5,Florida,[],Walmart
7,1328512627589468161,jjj5819,Tue Nov 17 01:37:47 +0000 2020,RT @QasimRashid: COVID19 has closed 20% of sma...,Worcester MA,[],Walmart
8,1328512204237398020,ymbertmarypaz,Tue Nov 17 01:36:06 +0000 2020,@AcidComment__ Walmart o Pricesmart!,Guatemala,[],Walmart
9,1328511952457506821,BBF8droid,Tue Nov 17 01:35:06 +0000 2020,RT @QasimRashid: COVID19 has closed 20% of sma...,Undrainable Swamp,[],Walmart
