Twitter Client

benoitbleuze edited this page Sep 11, 2012 · 16 revisions
Clone this wiki locally

Twitter Client

Welcome to part 2 of the OpenTechSchool Python Beginner series. In this part we will learn how to use Python to retrieve data from the internet.

Tools

If you participated in part 1 then you already have everything you need. This tutorial only requires an installation of Python (we use 2.7) and a web browser. You will not need a Twitter account.

Understanding data structures

Back in the first session we introduced three of the most common data types used in programming: numbers, strings and booleans. We assigned those data types to variables one-by-one, like so:

>>> x = 3          # numbers
>>> a = "gorillas" # strings
>>> t = True       # booleans

But what if we need something more complicated, like a shopping list? Assigning a variable for every item in the list would makes things very complicated:

>>> item_1 = "milk"
>>> item_2 = "cheese"
>>> item_3 = "bread"

Lists

Fortunately we don't have to do this. Instead, we have the list data type. An empty list is simply []

>>> shopping_list = []

When you are in the Python interpreter you can see what is inside a list by just typing the name of the list. For example:

>>> shopping_list
[]

The interpreter shows us that the list is empty.

Now we can add items to shopping_list. Try typing the following commands into the Python interpreter.

>>> shopping_list.append("milk")
>>> shopping_list.append("cheese")
>>> shopping_list.append("bread")
Exercise: What is in the shopping list? What happens when you append numbers or booleans to the list?

To remove an item from the list we use remove():

>>> shopping_list.remove("milk")

Lists can easily be processed in a for loop. Have a look at this example which prints each item of the list in a new row:

>>> for item in shopping_list:
>>>     print(item)

And that's it! Lists are the most common data structure in programming. There are lots of other things you can do with lists, and all languages have their own subtly different interpretation. But fundamentally they are all very similar.

In summary:

>>> shopping_list = []
>>> shopping_list.append("cookies")
>>> shopping_list.remove("cookies")

Dictionaries

The other main data type is the dictionary. The dictionary allows you to associate one piece of data with another. The analogy comes from real-life dictionaries, where we associate a word with it's meaning. It's a little harder to understand than a list, but Python makes them very easy to deal with.

You can create a dictionary with {}

>>> foods = {}

And you can add items to the dictionary like this:

>>> foods["banana"] = "A delicious and tasty treat!"
>>> foods["dirt"]   = "Not delicious. Not tasty. DO NOT EAT!"

As with lists, you can always see what is inside a dictionary:

>>> foods
{'banana': 'A delicious and tasty treat!', 'dirt': 'Not delicious. Not tasty. DO NOT EAT!'}

And you can also delete from a dictionary as well. We don't really need to include an entry for dirt:

>>> del foods["dirt"]

What makes dictionaries so useful is that we can give meaning to the items within them. A list is just a bag of things, but a dictionary is a specific mapping of something to something else. By combining lists and dictionaries you can describe basically any data structure used in computing.

For example, you can easily add a list to a dictionary:

>>> ingredients = {}
>>> ingredients["blt sandwich"] = ["bread", "lettuce", "tomato", "bacon"]

Or add dictionaries to lists:

>>> europe = []
>>> germany = {"name": "Germany", "population": 81000000}
>>> europe.append(germany)
>>> luxembourg = {"name": "Luxembourg", "population": 512000}
>>> europe.append(luxembourg)

Outside of Python, dictionaries are often called hash tables, hash maps or just maps.

How Twitter uses lists and dictionaries

Now we are going to see how popular websites such as Twitter organise their data into lists and dictionaries. The data can get a little complicated, but over time you will learn to filter out the information that you don't need.

Let's start with a look at a Twitter search. Twitter is easy because it returns data in the JSON format, which has very similar syntax to Python. Open the following URL in your browser:

http://search.twitter.com/search.json?q=python&rpp=1

This search consists of a base part http://search.twitter.com/search.json?, a required parameter q (query) with value python and an optional parameter rpp (results per page) with value 1. But note that sometimes it doesn't return any results and you might have to refresh.

To any reasonable person the result looks like garbage. It's a mess of syntax and quotation marks, optimised by Twitter to reduce the amount of data sent over the internet. The data is not very readable, but it actually contains one tweet related to the keyword python. You can copy and paste the data into jsonlint.com to view it.

http://jsonlint.com/

Ok now it should look a little better. See how the JSON format uses a similar syntax to Python dictionaries and lists. The whole result is just one big dictionary. See how Twitter returns the original query in query, and how results is a list of tweets.

Exercise: Change the rpp parameter and q parameter to different values and see how the amount of data varies. Paste it into jsonlint.com to see what kind of things people are tweeting about.

If you want to learn more about the Twitter Search API then you can browse the documentation online, though it isn't necessary for the rest of the tutorial.

Getting data into Python

Now that we know where the data is, it's time to get that into Python so we can play around with it. We are going to use the urllib2 module to retrieve the data, then the json module to convert JSON dictionaries and lists into Python dictionaries and lists.

Let's start with retrieving the data:

>>> import urllib2
>>> response = urllib2.urlopen('http://search.twitter.com/search.json?q=python&rpp=1')
>>> raw_data = response.read()
>>> print(raw_data)

After importing urllib2, we open the query URL and store it in the variable response.

The response contains a lot of behind-the-scenes information, but we only need the data, so we call the read() method and store it in raw_data.

Finally we print the data what will look very similar to what you have seen in the browser, i.e. super ugly. It's just one big string at this stage.

If you want to learn more about urllib2 then you can browse the documentation online.

Pythonize it!

Twitter uses the JSON notation to format the response. Fortunately the Python standard library contains a JSON parser which does all the work for us. After importing it, we can use json.loads() to convert a JSON string into data structure consisting of lists and dictionaries.

>>> import json
>>> data = json.loads(raw_data)

Now we have a variable called data which we can play with:

>>> print(data['query'])

Examine the keys of the dictionary. (There is a prettier version of the print() function called pprint(). You can import it with from pprint import pprint.) The key query contains the query string we sent to twitter: python. But much more interesting is results. It contains a list of tweets which matched our query. It's length should be equal to the value of the rpp parameter in the query.

Each tweet is again a dictionary containing various information about it and of course the message itself.

Now how about something a little fancier? Let's print these tweets!:

>>> tweets = data['results']
>>> for tweet in tweets:
>>>     print(tweet['text'])

Oh wait, we only asked for one tweet! That isn't very useful.

Exercise: Print the username (from_user) from each tweet followed by the message. You'll need to increase the number of tweets returned if you want more than 1. You can join strings together by using '+', like this:
>>> favourite_elephant = "Mae Perm"
>>> print("My favourite elephant is " + favourite_elephant)
My favourite elephant is Mae Perm

If you would like to learn more about JSON and how to use it in Python then you can browse the documentation online.

That's it!

Now you can make Twitter searches in Python! But where to go from here? Perhaps you have a few things that you would like to explore, or you would like to browse the documentation we linked to earlier. If you have some questions you would like to ask coaches, such as their own experiences with APIs like Twitter then we will be happy to answer them. We also have a few ideas that you might like to try below.

Reusable Code

You don't want to type in these commands into the interpreter every time, so let's create a function for this. Open a new file twitter.py and paste this code block into it.

import json
import urllib2

base_url = "http://search.twitter.com/search.json?"

def fetch_tweets(query, rpp):
    url = base_url + 'q=' + query + '&rpp=' + rpp
    response = urllib2.urlopen(url)
    raw_data = response.read()
    data = json.loads(raw_data)
    return data['results']

Now you can use the fetch_tweets() function to get different sets of tweets to experiment with. You may also paste the code directly into the Python interpreter to make the function available there.

The containment operator

Let's start to play with the data we got. You might have already noticed that people share links in their tweets. But how often? To find this out, we need a new operator: in. You already know ==, < and > from the first session. They return True or False depending on whether the condition is matched or not. The in operator returns True if something is contained in a list or a string. Therefore its also called a containment operator.

>>> shopping_list = ["bread", "milk", "butter"]
>>> "milk" in shopping_list
True
>>> "fun" in "Python is fun"
True

Knowing that, we can loop over the tweets and count how many contain a link by checking if the text contains http

>>> num_links = 0
>>> for tweet in tweets:
...     if "http" in tweet["text"]:
...         num_links += 1

There might be other interesting words to count. Play around with this and maybe write a function that accepts a list of tweets and a string as arguments and returns how often this string occurs.

Handling strings

Python has some functions that help handling strings. You can find a complete list in the documentation but for now the following will be sufficient:

>>> s = "Python is fun"
>>> s.startswith("Py")  # Return True if the string starts with the given string
True
>>> s.lower()  # Converts all letters to lower case
'python is fun'
>>> s.count("n")  # Count the number of occurences of the given string
2
>>> s.split(" ")  # Return a list of substrings using the given delimiter
['Python', 'is', 'fun']

What could we do with these? A very important aspect of Twitter are hashtags. They are words prefixed with a # sign and are used to group tweets by topic. Let's find all hashtags!

>>> all_words = []
>>> for tweet in tweets:
...     words = tweet["text"].split(" ")
...     for word in words:
...         all_words.append(word)
...
>>> hashtags = []
>>> for word in all_words:
...     if word.startswith("#"):
...         hashtags.append(word)
...

For sure some hashtags are used more often than others. What about some statistics?

>>> hashtag_stats = {}
>>> for hashtag in hashtags:
...     if hashtag in hashtag_stats:
...         old_value = hashtag_stats[hashtag]
...         hashtag_stats[hashtag] = old_value + 1
...     else:
...         hashtag_stats[hashtag] = 1
Exercise: Did you recognize that hashtags occur multiple time in our statistic when they are written in lowercase, uppercase or a mix of both? Use the lowercase() function to make them equal.

Storing files from the web

It's quite common to store data from the web on the local disc. Do to this, we need Python's file functions. To open a file for writing into it, use open("filename.txt", "w"). If the first argument is just a file name, Python uses the current directory (the directory where you invoked the python command).The second argument "w" indicates that you want to write to the file. If you want to read it, use "r" instead. After you wrote or read the file, you have to close it.

>>> f = open("test.txt", "w")
>>> f.write("Just a few words")
>>> f.close()
>>> f = open("test.txt", "w")
>>> text = f.read()
>>> f.close()

We can use this to store the profile image of a tweet's author on our disc. Load the image's URL with urllib2.urlopen(), read the data and write it to a file opened in write mode.

>>> url = "https://si0.twimg.com/profile_images/2434435324/6dam4pvl7agywxsxwc5c_normal.png"
>>> response = urllib2.urlopen(url)
>>> raw_data = response.read()
>>> image = open("profile_image.png", "w")
>>> image.write(raw_data)
>>> image.close()

Next steps

There are thousands of APIs available over the internet. Many of them aren't as simple as the Twitter API (there are lots of different data formats these days) but most companies take care to make it as easy as possible. You can find a catalog of them at ProgrammableWeb. They also have a collection of "mashups" where people combine multiple APIs together in new and interesting ways.

If you would like to try something a little more challenging then the Twitter Streaming API might be just the thing for you. In the Search API you give a request and Twitter returns a response. But in the Streaming API you give a request and then Twitter will send back an infinite stream of tweets that never ends. This presents some challenges because you can't just call read() to get all the tweets. This tutorial from Ars Technica might be worth a look.