## 1. Introduction to iterators
In this course, the second of the Python Data Science toolbox courses, you'll learn about iterables, iterators, list comprehensions and generators - all essential components in the Pythonista's Data Science toolbox. 
### 1.1. Theory.
We will conclude with an entire chapter devoted to a case study in which you'll apply time and time again techniques learned in both of these courses. Let's talk about iterators.
####  Iterating with a for loop
There's no reason to be scared of iterators because you have actually been working with them for some time now! 
- When you use a for loop to print out each element of a list, you're iterating over the list.

In [1]:
for letter in 'Github':
    print(letter)

G
i
t
h
u
b


- You can also use a for loop to iterate over characters in a string such as you see here. 
- You can also use a for loop to iterate a over a sequence of numbers produced by a special range object. 

In [2]:
for k in range(3):
    print(k)

0
1
2


#### Iterators vs. iterables
The reason that we can loop over such objects is that they are special objects called **iterables**: `lists`, `strings` and `range` objects are all iterables, as are many other `Python objects`, such as dictionaries and file connections! 

The actual definition of an iterable is an object that has an associated iter method. Once this iter method is applied to an iterable, an iterator object is created. 

Under the hood, this is actually what a for loop is doing: it takes an iterable, creates the associated iterator object, and iterates over it! 

#### Iterating over iterables: `next()`
An iterator is defined as an object that as an associated next method that produces the consecutive values. 

In [3]:
word = 'Hi'
it = iter(word)
next(it)

'H'

- To create an iterator from an iterable, all we need to do is use the function iter and pass it the iterable. 

In [4]:
next(it)

'i'

- Once we have the iterator defined, we pass it to the function next and this returns the first value. 

In [5]:
next(it)

StopIteration: 

- Calling next again on the iterator returns the next value until there are no values left to return and then it throws us a **StopIteration** error.

#### Iterating at once with *
You can also print all values of an iterator in one fell swoop using the star operator, referred to as the splat operator in some circles. 

In [6]:
word = 'Gith'
it = iter(word)
print(*it)

G i t h


This **star: *** operator unpacks all elements of an iterator or an iterable. 

In [7]:
print(*it)




Be warned, however, once you do so, you cannot do it again as there are **no more values** to iterate through! We would have to redefine our iterator to do so (preceding command-line).

#### Iterating over dictionaries
We mentioned before that dictionaries and file connections are iterables as well. 

To iterate over the key-value pairs of a Python dictionary, we need to unpack them by applying the items method to the dictionary as you can see here.

In [8]:
giv_dict = {'name': 'Nhan', 
            'd.o.b': '06-Oct-1991', 
            'job': 'Data Scientist'}
for key, value in giv_dict.items():
    print(key + "\t: " + value)

name	: Nhan
d.o.b	: 06-Oct-1991
job	: Data Scientist


#### Iterating over file connections
With respect to file connections, here you can see how to use the iter and next methods to return the lines from a file, `file.txt`.

In [9]:
file = open(r'../input/lslslslsl/WB.txt')
it = iter(file)
print(next(it))

CountryName,CountryCode,Year,Total Population,Urban population (% of total)



This has been your crash course in the fundamentals of iterables and iterators.

In [10]:
print(next(it))

Arab World,ARB,1960,92495902.0,31.285384211605397



### 1.2. PRACTICE
#### Exercise 1.2.1. Iterators vs Iterables
Let's do a quick recall of what you've learned about `iterables` and `iterators`. 

Recall from the video that an iterable is an object that can return an iterator, while an iterator is an object that keeps state and produces the next value when you call `next()` on it. In this exercise, you will identify which object is an iterable and which is an `iterator`.

Given that 

            flash1 = list(flash2)
and            

In [11]:
print(flash1)

['jay garrick', 'barry allen', 'wally west', 'bart allen']


Try printing out their values with `print()` and `next()` to figure out which is an iterable and which is an iterator.
#### Answers
>- `flash1`: `iterable`
>- `flash2`: `iterator`

#### Exercise 1.2.2. Iterating over iterables
In this exercise, you will reinforce your knowledge about these by iterating over and printing from iterables and iterators.

You are provided with a list of strings flash. You will practice iterating over the list by using a for loop. You will also create an iterator for the list and access the values from the iterator.
#### SOLUTION.

In [12]:
# Create a list of strings: flash
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen']

# Print each list item in flash using a for loop
for item in flash:
    print(item)

# Create an iterator for flash: superhero
superhero = iter(flash)

# Print each item from the iterator
print(next(superhero))
print(next(superhero))
print(next(superhero))
print(next(superhero))

jay garrick
barry allen
wally west
bart allen
jay garrick
barry allen
wally west
bart allen


#### Exercise 1.2.3. Iterating over iterables
One of the things you learned about in this chapter is that not all iterables are actual lists. A couple of examples that we looked at are strings and the use of the `range()` function. In this exercise, we will focus on the `range()` function.

You can use `range()` in a for loop as if it's a list to be iterated over:

            for i in range(5):
                print(i)
Recall that `range()` doesn't actually create the list; instead, it creates a range object with an iterator that produces the values $10^{100}$ until it reaches the limit (in the example, until the value 4). If `range()` created the actual list, calling it with a value of  may not work, especially since a number as big as that may go over a regular computer's memory. The value  is actually what's called a **Googol** which is a 1 followed by a hundred `0s`. That's a huge number!

Your task for this exercise is to show that calling `range()` with $10^{100}$ won't actually pre-create the list.

#### SOLUTION


In [13]:
# Create an iterator for range(3): small_value
small_value = iter(range(3))

# Print the values in small_value
print(next(small_value))
print(next(small_value))
print(next(small_value))

# Loop over range(3) and print the values
for num in range(3):
    print(num)

# Create an iterator for range(10 ** 100): googol
googol = iter(range(10 ** 100))

# Print the first 5 values from googol
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))

0
1
2
0
1
2
0
1
2
3
4


#### Exercise 1.2.4. Iterators as function arguments
You've been using the `iter()` function to get an iterator object, as well as the `next()` function to retrieve the values one by one from the iterator object.

There are also functions that take iterators and iterables as arguments. For example, the `list()` and `sum()` functions return a list and the sum of elements, respectively.

In this exercise, you will use these functions by passing an iterable from `range()` and then printing the results of the function calls.
#### SOLUTION.

In [14]:
# Create a range object: values
values = range(10, 21)

# Print the range object
print(values)

# Create a list of integers: values_list
values_list = list(values)

# Print values_list
print(values_list)

# Get the sum of values: values_sum
values_sum = sum(values)

# Print values_sum
print(values_sum)

range(10, 21)
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
165


## 2. Playing with iterators
### 2.1. Theory.
We're now going to dive a bit deeper into the world of iterables and iterators by checking out some very cool, very useful functions.

The first function, `enumerate`, will allow us to add a counter to any iterable while the second function, `zip`, will allow us to stitch together an arbitrary number of iterables.
#### Using enumerate()
`Enumerate` is a function that takes any iterable as argument, such as a list, and returns a special enumerate object, which consists of pairs containing the elements of the original iterable, along with their index within the iterable. 

In [15]:
subjects = ['Calculus', 'Big Data', 'Optimization', 'Statistic']
enum = enumerate(subjects)
print(type(enum))

<class 'enumerate'>


We can use the function `list` to turn this enumerate object into a list of `tuples` and print it to see what it contains.

In [16]:
enum_list = list(enum)
enum_list

[(0, 'Calculus'), (1, 'Big Data'), (2, 'Optimization'), (3, 'Statistic')]

#### Enumerate() and unpack
The enumerate object itself is also an iterable and we can loop over it while unpacking its elements using the clause for index, value in `enumerate(subjects)`.

In [17]:
for index, value in enumerate(subjects):
    print(index, value)

0 Calculus
1 Big Data
2 Optimization
3 Statistic


It is the default behavior of `enumerate` to begin indexing at 0. However, you can alter this with a second argument, `start`, which you can see here.

In [18]:
for index, value in enumerate(subjects, start = 10):
    print(index, value)

10 Calculus
11 Big Data
12 Optimization
13 Statistic


#### Using zip()
Now let's move on to `zip`, which accepts an arbitrary number of iterables and returns an `iterator of tuples`. 
- Here we have two lists, one of the `avengers`, the other of their `names`. 
- Zipping them together creates a zip object which is an iterator of `tuples`.

In [19]:
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
names = ['barton', 'stark', 'odinson', 'maximoff']
z = zip(avengers, names)
print(type(z))

<class 'zip'>


- We can turn this zip object into a list and print the list. The first element is a tuple containing the first elements of each list that was zipped.

In [20]:
z_list = list(z)
print(z_list)

[('hawkeye', 'barton'), ('iron man', 'stark'), ('thor', 'odinson'), ('quicksilver', 'maximoff')]


- The second element is a tuple containing the second elements of each list that was zipped and so on.

#### zip() and unpack
Alternatively, we could use a for loop to iterate over the zip object and print the tuples.

In [21]:
for z1, z2 in zip(avengers, names):
    print(z1, z2)

hawkeye barton
iron man stark
thor odinson
quicksilver maximoff


#### Print zip with `*`
We could also have used the **splat** operator to print all the elements!

In [22]:
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
names = ['barton', 'stark', 'odinson', 'maximoff']
z = zip(avengers, names)
print(*z)

('hawkeye', 'barton') ('iron man', 'stark') ('thor', 'odinson') ('quicksilver', 'maximoff')


### 2.2. PRACTICES.
#### Exercise 2.2.1. Using enumerate
You're really getting the hang of using iterators, great job!

You've just gained several new ideas on iterators from the last video and one of them is the `enumerate()` function. Recall that `enumerate()` returns an enumerate object that produces a sequence of tuples, and each of the tuples is an index-value pair.

In this exercise, you are given a list of strings mutants and you will practice using `enumerate()` on it by printing out a list of tuples and unpacking the tuples using a for loop.

#### SOLUTION.

In [23]:
# Create a list of strings: mutants
mutants = ['charles xavier', 
            'bobby drake', 
            'kurt wagner', 
            'max eisenhardt', 
            'kitty pryde']

# Create a list of tuples: mutant_list
mutant_list = list(enumerate(mutants))

# Print the list of tuples
print(mutant_list)

# Unpack and print the tuple pairs
for index1, value1 in enumerate(mutants):
    print(index1, value1)

# Change the start index
for index2, value2 in enumerate(mutants, start = 1):
    print(index2, value2)

[(0, 'charles xavier'), (1, 'bobby drake'), (2, 'kurt wagner'), (3, 'max eisenhardt'), (4, 'kitty pryde')]
0 charles xavier
1 bobby drake
2 kurt wagner
3 max eisenhardt
4 kitty pryde
1 charles xavier
2 bobby drake
3 kurt wagner
4 max eisenhardt
5 kitty pryde


#### Exercise 2.2.2. Using zip
Another interesting function that you've learned is `zip()`, which takes any number of iterables and returns a zip object that is an iterator of tuples. If you wanted to print the values of a zip object, you can convert it into a list and then print it. Printing just a zip object will not return the values unless you unpack it first. 

In this exercise, you will explore this for yourself.

In [24]:
aliases = ['prof x', 'iceman', 'nightcrawler', 'magneto', 'shadowcat']
powers = ['telepathy', 'thermokinesis', 'teleportation', 'magnetokinesis', 'intangibility']

Three lists of strings are pre-loaded: `mutants`, `aliases`, and `powers`. First, you will use `list()` and `zip()` on these lists to generate a list of tuples. 
- Then, you will create a `zip` object using `zip()`.
- Finally, you will unpack this zip object in a for loop to print the values in each tuple. 

Observe the different output generated by printing the list of tuples, then the zip object, and finally, the tuple values in the for loop.
#### SOLUTION

In [25]:
# Create a list of tuples: mutant_data
mutant_data = list(zip(mutants, aliases, powers))

# Print the list of tuples
print(mutant_data)

# Create a zip object using the three lists: mutant_zip
mutant_zip = zip(mutants, aliases, powers)

# Print the zip object
print(mutant_zip)

# Unpack the zip object and print the tuple values
for value1, value2, value3 in mutant_zip:
    print(value1, value2, value3)

[('charles xavier', 'prof x', 'telepathy'), ('bobby drake', 'iceman', 'thermokinesis'), ('kurt wagner', 'nightcrawler', 'teleportation'), ('max eisenhardt', 'magneto', 'magnetokinesis'), ('kitty pryde', 'shadowcat', 'intangibility')]
<zip object at 0x7fb4a5934d20>
charles xavier prof x telepathy
bobby drake iceman thermokinesis
kurt wagner nightcrawler teleportation
max eisenhardt magneto magnetokinesis
kitty pryde shadowcat intangibility


#### Exercise 2.2.3. Using `*` and `zip` to `'unzip'`
Let's play around with `zip()` a little more. There is no unzip function for doing the **reverse** of what `zip()` does. We can, however, reverse what has been zipped together by using `zip()` with a little help from `*`! 

`*` unpacks an iterable such as a list or a tuple into positional arguments in a function call.

In this exercise, you will use `*` in a call to `zip()` to unpack the tuples produced by `zip()`.

Two tuples of strings, mutants and powers have been pre-loaded.
#### SOLUTION.

In [26]:
# Create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)

# Print the tuples in z1 by unpacking with *
print(*z1)

# Re-create a zip object from mutants and powers: z1
z1 = zip(mutants, powers)

# 'Unzip' the tuples in z1 by unpacking with * and zip(): result1, result2
result1, result2 = zip(*z1)

# Check if unpacked tuples are equivalent to original tuples
print(result1 == mutants)
print(result2 == powers)

('charles xavier', 'telepathy') ('bobby drake', 'thermokinesis') ('kurt wagner', 'teleportation') ('max eisenhardt', 'magnetokinesis') ('kitty pryde', 'intangibility')
False
False


## 3. Using iterators to load large files into memory
### 3.1. Theory
#### Loading data in chunks
Now, we're going to check out a particular use case that is pertinent to the world of Data Science: dealing with large amounts of data. Let's say that you are pulling data from a file, database or API and there's so much of it, just so much data, that you can't hold it in memory. 

One solution is to load the data in chunks, perform the desired operation or operations on each chuck, store the result, discard the chunk and then load the next chunk; this sounds like a place where an iterator could be useful!To surmount this challenge, we are going to use the `pandas` function `read_csv`, which provides a wonderful option whereby you can load data in chunks and iterate over them. 

In [27]:
import pandas as pd
pd.read_csv(r'../input/lslslslsl/WB.txt', chunksize = 1000)

<pandas.io.parsers.TextFileReader at 0x7fb4a593b6d0>

All we need to do is to specify the chunk using the argument yep, you guessed it: `chunk_size`. As with much of what we do in Data Science, this is best illustrated by an example.

#### Iterating over data
Let's say that we have a csv with a column called 'x' of numbers and I want to compute the sum of all the numbers in that column. However, the file is **too large to store in memory**. 
- We first import pandas and then initialize an `empty list` `result` to hold the result of each iteration. 

In [28]:
result = []

- We then use the `read_csv` function, utilizing the argument chunk_size, setting it to the size of the chunks I want to read in. In this example, we use a chunk size of 1,000.

You can play around with it. The object created by the `read_csv` call is an iterable so I can can iterate over it, using a for loop, in which each chunk will be a DataFrame. Within the for loop, that is, on each iteration, we compute the sum of the column of interest and we append it to the list result. 

Once this is executed, we can take the sum of the list result and this gives us our total sum of the column of interest. Iterators to the rescue!

In [29]:
for chunk in pd.read_csv(r'../input/lslslslsl/WB.txt', chunksize = 1000):
    result.append(sum(chunk['Total Population']))
print(sum(result))

1921212098802.0


- Also note that we need not have used a list to store each result - we could have initialized total to `zero` before iterating over the file and added each sum during the iteration procedure, as you see here.

Now things get really cool: you're going to use an iterator to load `Twitter` data in chunks and perform a similar computation that you did in the prequel to this course.

In [30]:
total = 0
for chunk in pd.read_csv(r'../input/lslslslsl/WB.txt', chunksize = 1000):
    total += sum(chunk['Total Population'])
print(total)

1921212098802.0


### 3.2. PRACTICES
#### Exercise 3.2.1. Processing large amounts of Twitter data
Sometimes, the data we have to process reaches a size that is too much for a computer's memory to handle. This is a common problem faced by data scientists. A solution to this is to process an entire data source chunk by `chunk`, instead of a single go all at once.

In this exercise, you will do just that. You will process a large csv file of Twitter data in the same way that you processed `'tweets.csv'` in Bringing it all together exercises of the prequel course, but this time, working on it in chunks of 10 entries at a time.

In [31]:
tweets_file = r'../input/lslslslsl/tweets.txt'

The `pandas` package has been `imported as pd` and the file `'tweets.csv'` is in your current directory for your use.

#### SOLUTION.

In [32]:
# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Iterate over the file chunk by chunk
for chunk in pd.read_csv(tweets_file, chunksize=10):

    # Iterate over the column in DataFrame
    for entry in chunk['lang']:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1

# Print the populated dictionary
print(counts_dict)

{'en': 97, 'et': 1, 'und': 2}


#### Exercise 3.2.2. Extracting information for large amounts of Twitter data
Great job chunking out that file in the previous exercise. You now know how to deal with situations where you need to process a very **large file** and that's a very useful skill to have!

It's good to know how to process a file in smaller, more manageable chunks, but it can become very tedious having to write and rewrite the same code for the same task each time. In this exercise, you will be making your code more reusable by putting your work in the last exercise in a function definition.

The `pandas` package has been imported as pd and the file `'tweets.csv'` is in your current directory for your use.
#### SOLUTION.

In [33]:
# Define count_entries()
def count_entries(csv_file, c_size, colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Iterate over the file chunk by chunk
    for chunk in pd.read_csv(csv_file, chunksize = c_size):

        # Iterate over the column in DataFrame
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1

    # Return counts_dict
    return counts_dict

# Call count_entries(): result_counts
result_counts = count_entries(tweets_file, 10, 'lang')

# Print result_counts
print(result_counts)

{'en': 97, 'et': 1, 'und': 2}
