In order to have access to Twitter data programmatically, we need to create an app that interacts with the Twitter API.

The first step is the registration of your app. In particular, you need to point your browser to http://apps.twitter.com, log-in to Twitter (if you’re not already logged in) and register a new application. You will receive a consumer key and a consumer secret: these are application settings that should always be kept private. From the configuration page of your app, you can also require an access token and an access token secret. Similarly to the consumer keys, these strings must also be kept private: they provide the application access to Twitter on behalf of your account.

## Accessing the Data

Twitter provides REST APIs you can use to interact with their service. There is also a bunch of Python-based clients out there that we can use without re-inventing the wheel. In particular, Tweepy in one of the most interesting and straightforward to use

In [1]:
# In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface
import tweepy
from tweepy import OAuthHandler
 
consumer_key = 'kxLcMbtFff7B5VDznYIQ8WEVO'
consumer_secret = 'Lr0gesjMKULSU3OaY8Mfqp9Apf93K3GaMf1JR9jp8oUUv1flhj'
access_token = '1047459232310878208-ru4nbYzSb4OYdWzGORzjzbwDArLZIu'
access_secret = 'xkBy8BjC8hQ9hqdz8IMahh6tXd2adU4UrQnlnr1BxSQ20'
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

The api variable is now our entry point for most of the operations we can perform with Twitter.

In [6]:
# We can read our own timeline (i.e. our Twitter homepage)
for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    print(status.text)

What's happening? #Hello
test


In the example above we’re using 10 to limit the number of tweets we’re reading, but we can of course access more. The status variable is an instance of the Status() class, a nice wrapper to access the data. The JSON response from the Twitter API is available in the attribute _json (with a leading underscore), which is not the raw JSON string, but a dictionary.

In [13]:
import json

def process_or_store(tweet):
    print(json.dumps(tweet))

In [14]:
for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    process_or_store(status._json)

{"created_at": "Wed Oct 10 13:57:53 +0000 2018", "id": 1050022611181285381, "id_str": "1050022611181285381", "text": "What's happening? #Hello", "truncated": false, "entities": {"hashtags": [{"text": "Hello", "indices": [18, 24]}], "symbols": [], "user_mentions": [], "urls": []}, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 1047459232310878208, "id_str": "1047459232310878208", "name": "KB", "screen_name": "KB03809193", "location": "", "description": "", "url": null, "entities": {"description": {"urls": []}}, "protected": false, "followers_count": 0, "friends_count": 0, "listed_count": 0, "created_at": "Wed Oct 03 12:11:56 +0000 2018", "favourites_count": 0, "utc_offset": null, "time_zone": null, "geo_enabled": false, "verified": false, "statuses_count": 2, "lang": "en", "con

In [15]:
# List of all our followers
for friend in tweepy.Cursor(api.friends).items():
    process_or_store(friend._json)

In [16]:
# List all tweets
for tweet in tweepy.Cursor(api.user_timeline).items():
    process_or_store(tweet._json)

{"created_at": "Wed Oct 10 13:57:53 +0000 2018", "id": 1050022611181285381, "id_str": "1050022611181285381", "text": "What's happening? #Hello", "truncated": false, "entities": {"hashtags": [{"text": "Hello", "indices": [18, 24]}], "symbols": [], "user_mentions": [], "urls": []}, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 1047459232310878208, "id_str": "1047459232310878208", "name": "KB", "screen_name": "KB03809193", "location": "", "description": "", "url": null, "entities": {"description": {"urls": []}}, "protected": false, "followers_count": 0, "friends_count": 0, "listed_count": 0, "created_at": "Wed Oct 03 12:11:56 +0000 2018", "favourites_count": 0, "utc_offset": null, "time_zone": null, "geo_enabled": false, "verified": false, "statuses_count": 2, "lang": "en", "con

## Streaming

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. We need to extend the StreamListener() to customise the way we process the incoming data. A working example that gathers all the new tweets with the #python hashtag:

In [17]:
from tweepy import Stream
from tweepy.streaming import StreamListener
 
class MyListener(StreamListener):
 
    def on_data(self, data):
        try:
            with open('python.json', 'a') as f:
                f.write(data)
                return True
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True
 
    def on_error(self, status):
        print(status)
        return True
 
twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['#python'])

KeyboardInterrupt: 