We will show how to use `Selenium` and `Tweepy` to get past the `32,000` Tweet limit. 

We use `Selenium` to open up a browser and visit Twitter's search page. From the Twiter search page we can obtain the tweet IDs for a given user, then using `tweepy` we can obtain the contents for all tweet IDs obtained.


* Adapted from [Twitter Scraping](https://github.com/bpb27/twitter_scraping).
* The Authentication process follows [Intro_Collecting_Tweets](https://github.com/Data4Democracy/assemble/blob/master/tutorials/Intro_Collecting_Tweets.ipynb).
* In order to get this working you need to install [ChromeDriver]( https://sites.google.com/a/chromium.org/chromedriver/) 
 * For Ubuntu ([source](https://christopher.su/2015/selenium-chromedriver-ubuntu/)):

```
wget -N http://chromedriver.storage.googleapis.com/2.26/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver

sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
 ```

In [74]:
import json
import time
import datetime

from tweepy import API
from tweepy import OAuthHandler

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException 
from selenium.common.exceptions import StaleElementReferenceException

# Input Parameters

In [82]:
user = 'kdnuggets'

start = datetime.datetime(2017, 1, 15)  
end = datetime.datetime(2017, 1, 16)    

twitter_ids_filename = 'all_ids.json'

## Authentication

In [76]:
from config import *

def get_twitter_auth():
    """Setup Twitter Authentication.
    
    Return: tweepy.OAuthHandler object
    """
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    return auth
    
def get_twitter_client():
    """Setup Twitter API Client.
    
    Return: tweepy.API object
    """
    auth = get_twitter_auth()
    client = API(auth)
    return client

client = get_twitter_client()

## Helper Functions

In [77]:
def twitter_url(user, start, end):
    """Form url to access tweets via Twitter's search page.
    
    Return: string
    """
    url1 = 'https://twitter.com/search?f=tweets&q=from%3A'
    url2 = user + '%20since%3A' + start.strftime('%Y-%m-%d') 
    url3 = '%20until%3A' + end.strftime('%Y-%m-%d') + '%20include%3Aretweets&src=typd'
    return url1 + url2 + url3
    
def increment_day(date, i):
    """Increment day object by i days.
    
    Return: datetime object
    """
    return date + datetime.timedelta(days=i)

## Get Tweet IDs

In [79]:
# Adapted from https://github.com/bpb27/twitter_scraping

delay = 1  # time to wait on each page load before reading the page
driver = webdriver.Chrome() 

tweet_selector = 'li.js-stream-item'
id_selector = '.time a.tweet-timestamp'

ids = list()
for day in range((end - start).days + 1):
    # Get Twitter search url
    startDate = increment_day(start, 0)
    endDate = increment_day(start, 1)
    url = twitter_url(user, startDate, endDate)

    driver.get(url)
    time.sleep(delay)

    try:
        found_tweets = driver.find_elements_by_css_selector(tweet_selector)
        increment = 10

        # Scroll through the Twitter search page
        while len(found_tweets) >= increment:
            print('scrolling down to load more tweets')
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            time.sleep(delay)
            found_tweets = driver.find_elements_by_css_selector(tweet_selector)
            increment += 10
        print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))

        # Get the IDs for all Tweets
        for tweet in found_tweets:
            try:
                id = tweet.find_element_by_css_selector(id_selector).get_attribute('href').split('/')[-1]
                ids.append(id)
            except StaleElementReferenceException as e:
                print('lost element reference', tweet)

    except NoSuchElementException:
        print('no tweets on this day')

    start = increment_day(start, 1)

scrolling down to load more tweets
scrolling down to load more tweets
20 tweets found, 0 total
9 tweets found, 20 total


## Save Ids

In [80]:
# Adapted from https://github.com/bpb27/twitter_scraping

try:
    with open(twitter_ids_filename) as f:
        all_ids = ids + json.load(f)
        data_to_write = list(set(all_ids))
        print('tweets found on this scrape: ', len(ids))
        print('total tweet count: ', len(data_to_write))
except FileNotFoundError:
    with open(twitter_ids_filename, 'w') as f:
        all_ids = ids
        data_to_write = list(set(all_ids))
        print('tweets found on this scrape: ', len(ids))
        print('total tweet count: ', len(data_to_write))

with open(twitter_ids_filename, 'w') as outfile:
    json.dump(data_to_write, outfile)

print('Tweets Scraped!')
driver.close()

tweets found on this scrape:  29
total tweet count:  29
Tweets Scraped!


# Get Tweet Info

In [81]:
with open(twitter_ids_filename) as f:
    ids = json.load(f)
    for tweetId in ids:
        print(tweetId)
        tweet = client.get_status(tweetId)
        print(tweet.text)

820722556936351744
How Can Lean Six Sigma Help #MachineLearning? #KDN https://t.co/iPQJiQQhEz
820741858326409224
Solid Collection of "Top #MachineLearning Books" https://t.co/Dw4rL7ZzVF https://t.co/TC0PYFf8yK
821019653698940928
#ICYMI Game Theory Reveals the Future of #DeepLearning https://t.co/Dgq1dJI96h https://t.co/pB6hcmeG1c
820725725359652864
How IBM Is Using #ArtificialIntelligence to Provide #Cybersecurity https://t.co/b15TylAUrH #AI https://t.co/RJhBqxFTyc
820642932084637698
Exclusive Interview with top #DataScientist @Jeremyphoward on #DeepLearning, @Kaggle, #DataScience, and more… https://t.co/yUwleeOu0h
820767777808125952
What is the Role of the Activation Function in a Neural Network? #KDN https://t.co/jxzx8QChFl
821024768375869442
The Best Metric to Measure Accuracy of Classification Models #KDN https://t.co/9npH84wJ4S
821012639102996480
Poker Play Begins in "Brains Vs. AI: Upping the Ante" | Carnegie Mellon School of Computer Science… https://t.co/xwSoyPvXki
820769790860