# 01. Data Gathering 

In this notebook, we'll show you how we've gathered both historical with GetOldTweets3, and current tweets with Tweepy.

- [Install and Import Packages](#Import-and-Import-Packages)
- [Historical Twitter Data](#Historical-Twitter-Data)
- [Current Tweets](#Current-Tweets)
- [Sources](#Sources)

### Install and Import Packages

In [1]:
#you'll need to install these packages if they aren't already installed. 
#!pip install GetOldTweets3
#!pip install Tweepy

In [58]:
#Imports 
import numpy as np
import pandas as pd
import GetOldTweets3 as got
import time
import tweepy

### Historical Twitter Data

We used the GetOldTweets3 library to pull historical twitter data. We build a function to use GetOldTweets3 to pull a list of search words related to the disaster. Hurricane Michael began to develop on October 1st, 2018, so that's when we began our search and scrapped through October 15th.

In [59]:
#first we began by creating a dictionary to store tweet infromation in. adapted from this project: 
# https://github.com/giffordtompkinsiii/Evacuation_Twitter_Client_Project/blob/master/code/01_Historic_and_Live_Twitter_Scraper.ipynb
def create_tweet(tweet):    
    tweet_dict = {}
    tweet_dict['id'] = tweet.id
    tweet_dict['username'] = tweet.username
    tweet_dict['date'] = tweet.date
    tweet_dict['text'] = tweet.text
    tweet_dict['hashtags'] = tweet.hashtags
    tweet_dict['geo'] = tweet.geo
    return tweet_dict

In [66]:
#function to pull tweets using GetOldTweets3 
def get_tweets(search_criteria,start_date, end_date, tweets_per_search):
    tweet_list = []
    for word in search_criteria: # queries for word in list of words to search
            tweetCriteria = got.manager.TweetCriteria().setQuerySearch(word)\
                                                       .setSince(start_date)\
                                                       .setUntil(end_date)\
                                                       .setMaxTweets(tweets_per_search)\

            tweets = got.manager.TweetManager.getTweets(tweetCriteria)
            time.sleep(2)

            for tweet in tweets:
                tweet_list.append(create_tweet(tweet))
    return pd.DataFrame(tweet_list)

Next, compile a list of words to use as your search parameters. Since many tweets don't have any attached location data, the more specific these terms are, the better. We recommend having a pre-built list with popular hashtags used to identify the area affected and any terms that might be used to identify the event.

In [67]:
#list of words to include in our GetOldTweets3 query. 
search_words = ['hurricanemichael','hurricane', 'panamacitybeach', 'panamacityflorida', '850strong', 'gulfpower',
                'baycounty', 'floridastrong', 'panhandlestrong']


From there we used the function to pull tweets, then check for duplicates.

In [68]:
#calling the function to pull tweets, Oct. 1 - Oct. 15st 2018, returning 500 tweets per searchterm 
test = get_tweets(search_words, '2018-10-01' , '2018-10-15', 500)

In [110]:
#checking out the shape 
test.shape

(3790, 6)

In [111]:
#dropping any duplicates. 
test.drop_duplicates(subset = 'id', inplace = True)

In [112]:
#re-checking shape, we've droppped 132 duplicate tweets 
test.shape

(3658, 6)

In [113]:
#checking out the DF 
test

Unnamed: 0,id,username,date,text,hashtags,geo
0,1051623671951974400,GulfPower,2018-10-14 23:59:56+00:00,“We are pleased to be making steady progress a...,#HurricaneMichael,
1,1051623669363965952,Postcards4Potus,2018-10-14 23:59:55+00:00,@realDonaldTrump really doesn't care! Seriousl...,#HurricaneMichael,
2,1051623651320184832,LauraHKByrne,2018-10-14 23:59:51+00:00,This is an excellent point. The best and easie...,#HurricaneMichael,
3,1051623649197875201,SupplierCom,2018-10-14 23:59:50+00:00,For those affected by #HurricaneMichael member...,#HurricaneMichael,
4,1051623615911927808,Heart_to_Heart,2018-10-14 23:59:42+00:00,The devastation from #HurricaneMichael is hard...,#HurricaneMichael #PanamaCity #Florida,
...,...,...,...,...,...,...
3785,1050007027676835842,valerie__lyn,2018-10-10 12:55:58+00:00,Praying for my home and everyone in the panhan...,#FloridaStrong #PanhandleStrong #HurricaneMichael,
3786,1050005160351666176,justinkieferwx,2018-10-10 12:48:33+00:00,#Michael | live streaming coverage on http://f...,#Michael #panhandlestrong,
3787,1049993102814011392,Juanyehthomas1,2018-10-10 12:00:38+00:00,I pray my people back home in the panhandle st...,#PanhandleStrong,
3788,1049849561542479872,legacyonthebay,2018-10-10 02:30:15+00:00,Our area is prepared and ready for action! #pa...,#panhandlestrong,


Next, we'll add an aditional column to this dataframe, since these tweets are pulled via search word, we'll assign a 0 to the traffic column to idicate the tweets did not come specifically from a traffic twitter account. Then we'll write the data to a csv for data cleaning and modeling. 

In [114]:
#creating new column, assigning value 0 
test['traffic'] = 0 

In [72]:
#writting data to CSV, commented out as to not re-write my data. 
#test.to_csv('../datasets/searched_tweets.csv', index = False)

Next, we'll gather data from the twitter users who report on traffic. The function below has alot of similarities with the previous one, but this one is searching based off usernames instead of key words. 

In [115]:
def get_tweets_accounts(search_username ,start_date, end_date, tweets_per_search, is_loc = True):
    tweet_list = []
    if is_loc:
        for word in search_username: 
            tweetCriteria = got.manager.TweetCriteria().setUsername(search_username)\
                                                       .setSince(start_date)\
                                                       .setUntil(end_date)\
                                                       .setMaxTweets(tweets_per_search)\

            tweets = got.manager.TweetManager.getTweets(tweetCriteria)
            time.sleep(2)

            for tweet in tweets:
                tweet_list.append(create_tweet(tweet))
        return pd.DataFrame(tweet_list)

Next, create a list of twitter usernames that you'd like to search. The usernames that we used to pull data are: @fl511_panhand1, 511 for the Florida panhandle; @BayCountyTMC,  Bay County Traffic Management Center; and @WJHG_TV, a Northwest Florida news station. We'll run through the same processing steps as the previous data set. 

In [116]:
#list of usernames to search
usernames = ['BayCountyTMC' , "fl511_panhandl" , 'WJHG_TV']

In [117]:
#pulling the tweets 
traffic_tweets =get_tweets_accounts(usernames, '2018-10-01' , '2018-10-15', 1000)

In [118]:
#checking out the shape 
traffic_tweets.shape

(2076, 6)

In [119]:
#dropping duplicate tweets 
traffic_tweets.drop_duplicates(subset = 'id' , inplace = True)

In [120]:
#calling shape again, there were a lot of duplicate tweets in this pull. 
traffic_tweets.shape

(692, 6)

In [121]:
#creating a categorical system to identify which search this tweet came from. 
traffic_tweets['traffic'] = 1 

In [122]:
#writting to CSV for working data. 
#traffic_tweets.to_csv('../datasets/account_tweets.csv', index = False)

## Current Tweets

 Twitter's free API, and the python library Tweepy, allow us to pull data from the specific keywords and from specific usernames for the past 7 days. We'll feed our same search parameters from our historic tweets search for the current tweets as well. 

The first step for pulling tweets via Tweepy is to access our API keys, then we'll authenticate our connection via Tweepy. To do this on your own, you'll need to apply for your Twitter developer keys. 

In [19]:
#reading in our hidden keys 
ENV = pd.read_json("../env.json", typ = 'series')

In [20]:
#assigning keys and tokens to variables, here is where you'd enter your own Twitter API keys 
API_KEY= ENV["API key"]
API_SECRET_KEY = ENV["API secret key"]
ACCESS_TOKEN = ENV["Access token"]
ACCESS_TOKEN_SECRET = ENV['Access token secret']

In [21]:
#authenticating via tweepy our API access. 
auth = tweepy.OAuthHandler(API_KEY , API_SECRET_KEY)
auth.set_access_token(ACCESS_TOKEN , ACCESS_TOKEN_SECRET)

api = tweepy.API(auth, wait_on_rate_limit=True)

First, we created a function to pull the tweets from the specific usernames, using the Tweepy wrapper. 

In [24]:
#Created a function that pull the most recent 200 tweets from the twitter user who's handle you feed into the function.  
# This was adapted from a previous group : https://github.com/DCapella/evacuation-routes/blob/master/code/The_Function_H.Toumy_X.Sin'Austin.ipynb

def gather_tweets(handle:str, n=200): # takes twitter username , and number of tweets to return, 200 is the max. 
    tweets_everything = api.user_timeline(handle, count = n)     #using the tweepy API to pull the data
    df = pd.DataFrame(columns = ['id', 'tweets', 'date', 'location'])  #empty DF the tweets will eventually go into 
    for i in tweets_everything: # for each tweet 
        tweets = i.text         #text of tweet 
        try: 
            date = i.formatted_date     #trying to find date by calling formated_date, but if that doesn't work 
        except: 
            date = i.created_at     #pull when the tweet was created_at 
        try:
            location = i.geo['coordinates'] #next, we'll check to make sure there aren't any attached geo data
        except: 
            try: 
                location = i.coordinates  # if theres no stored geo data, does the tweet itself have coordinates.
            except: 
                location = 'NaN'   #finally, fill with NAN if we can recover no location data. 
                
        tweet_id = i.id     # tweet id assigned 
        df.loc[len(df)] = [tweet_id, tweets, date, location]  #creating DF 
        
    return df

We then used that function to pull the most recent tweets from our 3 previously searched traffic and news accounts: @BayCountyTMC , @fl511_panhandl, and @WJHG_TV. We then concated the data to a single DataFrame for export. 

In [25]:
#running the function on our first Twitter user then creating a username column in the DF, checking out the first few rows
traffic1 = gather_tweets('BayCountyTMC')
traffic1['username'] = 'BayCountyTMC'
traffic1.head(3)

In [27]:
#running function to pull our second twitter user tweets, adding a username column. 
traffic2 = gather_tweets('fl511_panhandl')
traffic2['username']= 'fl511_panhandl'
traffic2.head(3)

In [None]:
#pulling final twitter users tweets from the past 7 days. 
traffic3 = gather_tweets('WJHG_TV')
traffic3['username'] = 'WJHG_TV'
traffic3.head(3)

In [32]:
#concating that DF to be exported and used as testing data. 
live_traffic = pd.concat(objs = [traffic1, traffic2, traffic3],
                       axis= 0)

live_traffic.shape

(600, 5)

In [None]:
#writing data to CSV, commented out to ensure we don't rewrite. 
#live_traffic.to_csv('../datasets/live_traffic.csv', index = False)

From there created the function to pull key search words.

In [34]:
#Created function that pulls search data via Tweepy. again this was adapted from the same code above. 
def gather_tweets(q , n=200):   #search query, number of tweets to return
    tweets_everything = api.search(q , count = n)   #using Tweepy API for the search 
    df = pd.DataFrame(columns = ['id', 'tweets', 'date', 'location']) #creating the df for the tweets. 
    
    for i in tweets_everything:  #for each tweet 
        tweets = i.text      #text of tweet 
        try: 
            date = i.formatted_date   #trying to find the date by calling, fornatted_date, but if doesn't work
        except: 
            date = i.created_at      #pull when the tweet was created_at 
        try:
            location = i.geo['coordinates']  #next, we'll check to make sure there aren't any attached geo data
        except: 
            try: 
                location = i.coordinates  #if there is no stored geo data, is there coordinates attatched to the tweet
            except: 
                location = 'NaN'  #fill values where no coordinate info was found as NAN
                
        tweet_id = i.id           #tweet ID 
                
        df.loc[len(df)] = [tweet_id, tweets, date, location] # looping through tweets and adding them to DF
        
    return df

In [62]:
#as a reminder, here are our search words 
search_words 

['hurricanemichael',
 'hurricane',
 'panamacitybeach',
 'panamacityflorida',
 '850strong',
 'gulfpower',
 'baycounty',
 'floridastrong',
 'panhandlestrong']

From there we use the function to pull the recent twitter data for our search words and concat them to a single DataFrame for export. 

In [71]:
#here we're running the list of words through the function 
hurricane_michael = gather_tweets('hurricanemichael')
hurricane = gather_tweets('hurricane')
panama_city_beach = gather_tweets('panamacitybeach')
panama_city_florida = gather_tweets('panamacityflorida')
strong = gather_tweets('850strong')
gulfpower = gather_tweets('gulfpower')
baycounty = gather_tweets('baycounty')
florida_strong = gather_tweets('floridastrong')
panhandle_strong = gather_tweets('panhandlestrong')

In [72]:
#Concating our search words DF into a single DF 
live_search = pd.concat( objs = [hurricane_michael, hurricane, panama_city_beach,
                                 panama_city_florida, strong, gulfpower,
                                 baycounty, florida_strong, panhandle_strong], axis = 0 )

In [73]:
#finding the shape of the data
live_search.shape

(422, 4)

In [None]:
#writting to CSV 
#live_search.to_csv('../datasets/live_search.csv', index = False)

## Sources 

[Previous Group Repository](https://github.com/DCapella/evacuation-routes)  
[GetOldTweets3 Documentation](https://pypi.org/project/GetOldTweets3/)  
[Tweepy Documentation](http://www.tweepy.org/)
