# Chapter 7: More on loops

In the previous chapters, we have often used the powerful concept of looping in Python. Using loops, we can easily repeat certain actions when coding. With for loops, for instance, it is really easy to visit the items in a list in a list and print them. In this chapter, we will discuss some more advanced forms of looping, as well as new, quick ways to create and deal with lists and other iterable data sequences.

### Range

The first new function that we will discuss here is `range()`. Using this function, we can quickly generate a list of numbers in a specific range:

In [None]:
for i in range(10):
    print(i)

Here, `range()` will return a number of integers, starting from zero, up to (but not including) the number which we pass as an argument to the function. Using `range()` is of course much more convenient to generate such lists of numbers than writing e.g. a while-loop to achieve the same result. Note that we can pass more than one argument to `range()`, if we want to start counting from a number higher than zero (which will be the default when you only pass a single parameter to the function):  

In [None]:
for i in range(300, 306):
    print(i)

We can even specify a 'step size' as a third argument, which controls how much a variable will increase with each step:

In [None]:
for i in range(15, 26, 3):
    print(i)

If you don't specify the step size explicitly, it will default to 1. If you want to store or print the result of calling `range()`, you have to cast it explicitly, for instance, to a list:

In [None]:
numbers = list(range(10))
print(numbers[3:])

### Enumerate

Of course, `range()` can also be used to iterate over the items in a list or tuple, typically in combination with calling `len()` to avoid `IndexErrors`: 

In [None]:
words = "Be yourself; everyone else is already taken".split()
for i in range(len(words)):
    print(words[i])

Naturally, the same result can more easily be obtained by looping over `words` directly:

In [None]:
 for word in words:
    print(word)

One drawback of such an easy-to-write loop, however, is that it doesn't keep track of the index of the word that we are printing in one of the iterations. Suppose that we would like to print the index of each word in our example above, we would then have to work with  a counter...

In [None]:
counter = 0
for word in words:
    print(word, ": index", counter)
    counter+=1

... or indeed use a call to `range()` and `len()`:

In [None]:
for i in range(len(words)):
    print(words[i], ": index", i)      

A function that makes life in Python much easier in this respect is `enumerate()`. If we pass a list to `enumerate()`, it will return a list of mini-tuples: each mini-tuple will contain as its first element the indices of the items, and as second element the actual item:

In [None]:
print(list(enumerate(words)))

Here -- as with `range()` -- we have to cast the result of `enumerate()` to e.g. a list before we can actually print it. Iterating over the result of `enumerate()`, on the other hand, is not a problem. Here, we print out each mini-tuple, consisting of an index and an item in a for loop:

In [None]:
for mini_tuple in enumerate(words):
    print(mini_tuple)

When using such for loops and `enumerate()`, we can do something really cool. Remember that we can 'unpack' tuples with multiple assignment: if a tuple consists of two elements, we can unpack it on one line of code to two different variables via the assignment operator: 

In [None]:
item = (5, 'already')
index, word = item # this is the same as: index, word = (5, "already")
print(index)
print(word)

In our for loop example, we can apply the same kind of unpacking in each iteration:

In [None]:
for item in enumerate(words):
    index, word = item
    print(index)
    print(word)
    print("=======")

However, there is also a super-convenient shortcut for this in Python, where we unpack each item in the for-statement already:

In [None]:
for index, word in enumerate(words):
    print(index)
    print(word)
    print("====")

How cool is that? Note how easy it becomes now, to solve our problem with the index above:

In [None]:
for i, word in enumerate(words):
    print(word, ": index", i)

#### DIY 1
Let's put that into practice. First, extract the lines from the file `f` and put it in the variable `lines`. Then, loop over `lines` and print the line number followed by the line itself. Remember that `enumerate()` uses zero-based indexing...

In [None]:
import codecs
f = codecs.open("data/austen-emma-excerpt.txt", "r", "utf-8")

### Zip

Obviously, `enumerate()` can be really useful when you're working with lists or other kinds of data sequences. Another helpful function in this respect is `zip()`. Suppose that we have a small database of 5 books in the forms of three lists: the first list contains the titles of the books, the second the author, while the third list contains the dates of publication: 

In [None]:
titles = ["Emma", "Stoner", "Inferno", "1984", "Aeneid"]
authors = ["J. Austen", "J. Williams", "D. Alighieri", "G. Orwell", "P. Vergilius"]
dates = ["1815", "2006", "Ca. 1321", "1949", "before 19 BC"]

In each of these lists, the third item always corresponds to Dante's masterpiece and the last item to the Aeneid by Vergil, which inspired him. The use of `zip()` can now easily be illustrated:

In [None]:
print(list(zip(titles, authors)))
print(list(zip(titles, dates)))
print(list(zip(authors, dates)))

Do you see what happened here? In fact, `zip()` really functions like a 'zipper' in the real-world: it zips together multiple lists, and return a list of mini-tuples, in which the correct authors, titles and dates will be combined with each other. Moreover, you can pass more than two sequences at once to `zip()`:

In [None]:
print(list(zip(authors, titles, dates)))

How awesome is that? Here too: don't forget to cast the result of `zip()` to a list or tuple, e.g. if you want to print it. As with `enumerate()` we can now also unzip each mini-tuple when declaring a for-loop:

In [None]:
for author, title in zip(authors, titles):
    print(author)
    print(title)
    print("===")

As you can understand, this is really useful functionality for dealing with long, complex lists and especially combinations of them.

#### DIY 2
Suppose you want to make a `rot13` dictionary for the Caesar cipher (which maps each character to the character 13 places up or down the alphabet). Can you try making this dictionary using `letters1` and `letters2` and `zip()`?


In [None]:
letters1 = list("abcdefghijklm")
letters2 = list("nopqrstuvwxyz")
rot13 = {}

## Your code goes here


## Test
message = "pbatenghyngvbaf, lbh oebxr gur pbqr!"
decrypted_message = "".join(rot13.get(l, l) for l in message)
    # FYI: the .get dictionary method looks up the first argument as a key, and returns the second argument in case of a KeyError
print(decrypted_message)

### Bonus material: comprehensions

Lists and for loops are used all the time in programming. If you are interested, let's have to look at a more concise and easy way to create and fill new lists in Python: _list comprehensions_. They are also often used to change one list into another. Typically, comprehensions can be written in a single line of Python code, which is why people often feel like they are more readable than normal Python for loops. Let's start with an example. Say that we would like to fill a list of numbers that represent the length of each word in a sentence, but only if that word isn't a punctuation mark. By now, we can of course easily create such a list using a for loop:   

In [None]:
import string
words = "I have not failed . I’ve just found 10,000 ways that won’t work .".split()
word_lengths = []
for word in words:
    if word not in string.punctuation:
        word_lengths.append(len(word))
print(word_lengths)

We can create the exact same list of numbers using a list comprehension which only takes up one line of Python code:

In [None]:
word_lengths = [len(word) for word in words if word not in string.punctuation]
print(word_lengths)

OK, impressive, but there are a lot of new things going on here. Let's go through this step by step. The first step is easy: we initialize a variable `word_lengths` to which we assign a value using the assignment operator. The type of that value will eventually be a list: this is indicated by the square brackets which enclose the list comprehension:

In [None]:
print(type(word_lengths))

Inside the squared brackets, we can find the actual comprehension which will determine what goes inside our new list. Note that it is not always possible to read these comprehensions from left to right, so you will have to get used to the way they are built up from a syntactic point of view. First of all, all the way on the left, we add an expression that determines which elements will make it into our list, in this case: `len(word)`. The variable `word`, in this case, is generated by the following for-statement: `for word in words`. Finally, we add a condition to our statement that will determine whether or not `len(word)` should be added to our list. In this case, `len(word)` will only be included in our list if the word is not a punctuation mark: `if word not in string.punctuation`. This is a full list comprehension, but simpler ones exist. We could for instance not have called `len()` on word before appending it to our list. Like this, we could, for example, easily remove all punctuation for our wordlist:  

In [None]:
words_without_punc = [word for word in words if word not in string.punctuation]
print(words_without_punc)

Moreover, we don't have to include the if-statement at the end (it is always optional):

In [None]:
all_word_lengths = [len(word) for word in words]
print(all_word_lengths)

In the comprehensions above, `words` is the only pre-existing input to our comprehension; all the other variables are created and manipulated inside the comprehension. The new `range()` function which we saw at the beginning of this chapter is also often used as the input for a comprehension:

In [None]:
square_numbers = [x*x for x in range(10)]
print(square_numbers)

Good programmers can do amazing things with comprehensions. With list comprehensions, it becomes really easy, for example, to create nested lists (lists that themselves consist of lists or tuples). Can you figure out what is happening in the following code block?

In [None]:
nested_list = [[x,x+2] for x in range(10, 22, 3)]
print(nested_list)
print(type(nested_list))
print(type(nested_list[3]))

In the first line above, we create a new list (`nested_list`) but we don't fill it with single numbers, but instead with mini-lists that contain two values. We could just as easily have done this with mini-tuples, by using round brackets. Can you spot the differences below? 

In [None]:
nested_tuple = [(x,x+2) for x in range(10, 22, 3)]
print(nested_tuple)
print(type(nested_tuple))
print(type(nested_tuple[3]))

Note that `zip()` can also be very useful in this respect, because you can unpack items inside the comprehension. Do you understand what is going in the following code block?

In [None]:
a = [2, 3, 5, 7, 0, 2, 8]
b = [3, 2, 1, 7, 0, 0, 9]
diffs = [a-b for a, b in zip(a, b)]
print(diffs)

Again, more complex comprehensions are thinkable:

In [None]:
diffs = [abs(a-b) for a,b in zip(a, b) if (a & b)] # abs converts negative numbers to positive ones
print(diffs)

Lots of things going on on that one line - you are starting to become a real pro at comprehensions!

Finally, we should also mention that dictionaries can also be filled in a one-liner using such comprehensions. Since dictionaries consist of key-value pairs, the syntax is slightly more complicated. Here, you have to make sure that you link the correct key to the correct value using a colon, in the very first part of the comprehension. The following example will make this clearer:

In [None]:
counts = {word:len(word) for word in words}
print(counts)

#### DIY 3
If you want to try your hand at making some list comprehensions, here you go! Try rewriting the for loops below as comprehensions! The check at the end will print `True` if you did it correctly!

In [None]:
words = ["This", "is", "a", "short", "list", "of", "WORDS", "!"]

## For loop
lowercased_words1 = []
for word in words:
    lowercased_words1.append(word.lower())

## Comprehension
lowercased_words2 = 

## Check
print(lowercased_words2)
print(lowercased_words1 == lowercased_words2)

In [None]:
alphabet = "abcdefghijklmnopqrstuvwxyz"

## For loop
consonants1 = []
for letter in alphabet:
    if letter not in "aeiouy":
        consonants1.append(letter)

## Comprehension
consonants2 = 

## Check
print(consonants2)
print(consonants1 == consonants2)

In [None]:
## For loop
laughter1 = {}
for i in range(1, 10):
    laughter1[i] = "ha"*i

## Comprehension
laughter2 = 

## Check
print(laughter1)
print(laughter1 == laughter2)

---
## Web crawling and parsing

One very useful application of scripts is for processing online data. We won't describe in detail how to do that, but will give a few examples using the `requests` library for fetching pages, and the `bs4` (BeautifulSoup) library for parsing HTML.

Let's take these libraries for a spin! Suppose we want to know what LT3 researchers like to write about, by collecting a corpus of paper abstracts from the [LT3 publications page](https://www.lt3.ugent.be/publications/). First, we want to crawl links to individual paper pages, where the abstracts can be found.

In [None]:
import requests
import bs4

r = requests.get('https://www.lt3.ugent.be/publications') # This will store a "Response" object in r
html = r.text # Response objects have an attribute "text" which contains the HTML of the page
soup = bs4.BeautifulSoup(html, "lxml") # This returns a parsed HTML object
links = []
for link in soup.find_all("a"): # Try help(soup.find_all) to understand what this does.
                                # It takes a lot of work out of your hands!
    print(link["href"])
    links.append(link["href"]) # Let's store all the links found on this page for later

We now have a list of all the links that we found on `https://www.lt3.ugent.be`. Not all of them are links to individual paper pages, and they are also relative links. Let's fix that!

For the first problem, we can use the fact that all individual pages have a link of the form `/publications/slug-of-publication-name/`. In other words, they all start with `/publications/` and end with a trailing slash. Note that the link `/publications/` also satisfies these conditions, so we need to make sure that there are exactly 3 slashes. Of course, you could come up with other rules for getting the right links, this is just one solution.

This is a perfect time for using list comprehensions! We have a list of links, and we want to improve it.

In [None]:
links = [l for l in links if l.startswith("/publications/")]
links = [l for l in links if l.endswith("/")]
links = [l for l in links if l.count("/") == 3]
print(links)

In [None]:
# Now, let's prefix all the links with the domain name, so we get absolute links instead of relative ones
links = ["https://www.lt3.ugent.be" + l for l in links] # Simple string concatenation!
print(links)

Time to get our abstracts! By looping over all these links, downloading each page and extracting what we want from it, we can fill a list with abstracts:

In [None]:
abstracts = []
for link in links:
    print(link)
    r = requests.get(link)
    soup = bs4.BeautifulSoup(r.text, "lxml")
    abstract_div = soup.find("div", {"class": "textfield"})
    print(abstract_div)
    if abstract_div: # Sometimes this is None, when a publication page has no abstract
        # We are only interested in the text of the abstract, not the HTML code around it (div and p tags)
        # BeautifulSoup objects make it very easy to access child elements of an HTML object, using dot notation
        abstract = abstract_div.p.text # because every abstract_div contains a p containing text
        abstracts.append(abstract)
print("Finished crawling", len(abstracts), "abstracts!")

Hooray, we collected a bunch of abstracts! Let's have some fun with them!

In [None]:
## Make a frequency dictionary
freq_dict = {}
for abstract in abstracts:
    words = abstract.lower().split()
    for word in words:
        freq_dict[word] = freq_dict.get(word, 0) + 1

## Find the most frequent words
freq_tuples = list((freq, word) for word, freq in freq_dict.items()) # We need to have the frequency first
freq_tuples.sort() # Because that is what we want to sort on
freq_tuples.reverse() # In descending order
print("The top 30 most frequently used words are:")
for freq, word in freq_tuples[:30]:
    print(freq, word)

# Get the tuples for which the word is longer than 5 characters (to get fewer function words)
freq_tuples = [tup for tup in freq_tuples if len(tup[1]) > 5]
print("The top 30 most frequently used words longer than 5 characters are:")
for freq, word in freq_tuples[:30]:
    print(freq, word)
    
abstract_word_lengths = [len(a.split()) for a in abstracts]
print("Average abstract length in words:", sum(abstract_word_lengths)/len(abstract_word_lengths))


------------------------------

You've reached the end of Chapter 7! You can safely ignore the code below, it's only there to make the page pretty:

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()