### Generators - where to find them

**Generator** - what do you associate this word with? Perhaps it refers to some electronic device. Or perhaps it refers to a heavy and serious machine designed to produce power, electrical or other.

A Python generator is a piece of specialized code able to produce a series of values, and to control the iteration process. This is why generators are very often called **iterators**, and although some may find a very subtle distinction between these two, we'll treat them as one.

You may not realize it, but you've encountered generators many, many times before. Take a look at the very simple snippet:

In [1]:
for i in range(5):
    print(i)


0
1
2
3
4


The **range()** function is, in fact, a generator, which is (in fact, again) an iterator.

What is the difference?

A function returns one, well-defined value - it may be the result of a more or less complex evaluation of, e.g., a polynomial, and is invoked once - only once.

A generator returns a series of values, and in general, is (implicitly) invoked more than once.

In the example, the range() generator is invoked six times, providing five subsequent values from zero to four, and finally signaling that the series is complete.

The above process is completely transparent. Let's shed some light on it. Let's show you the **iterator protocol**.

   

### Generators - where to find them: continued

The iterator protocol is a way in which an object should behave to conform to the rules imposed by the context of the for and in statements. An object conforming to the iterator protocol is called an iterator.

An iterator must provide two methods:

    __iter__() which should return the object itself and which is invoked once (it's needed for Python to successfully start the iteration)
    __next__() which is intended to return the next value (first, second, and so on) of the desired series - it will be invoked by the for/in statements in order to pass through the next iteration; if there are no more values to provide, the method should raise the StopIteration exception.

Does it sound strange? Not at all. Look at the example in the editor.

In [2]:
class Fib:
    def __init__(self, nn):
        print("__init__")
        self.__n = nn
        self.__i = 0
        self.__p1 = self.__p2 = 1

    def __iter__(self):
        print("__iter__")
        return self

    def __next__(self):
        print("__next__")				
        self.__i += 1
        if self.__i > self.__n:
            raise StopIteration
        if self.__i in [1, 2]:
            return 1
        ret = self.__p1 + self.__p2
        self.__p1, self.__p2 = self.__p2, ret
        return ret


for i in Fib(10):
    print(i)

__init__
__iter__
__next__
1
__next__
1
__next__
2
__next__
3
__next__
5
__next__
8
__next__
13
__next__
21
__next__
34
__next__
55
__next__


We've built a class able to iterate through the first n values (where n is a constructor parameter) of the Fibonacci numbers.

Let us remind you - the Fibonacci numbers (Fibi) are defined as follows:

Fib1 = 1
Fib2 = 1
Fibi = Fibi-1 + Fibi-2

In other words:

    the first two Fibonacci numbers are equal to 1;
    any other Fibonacci number is the sum of the two previous ones (e.g., Fib3 = 2, Fib4 = 3, Fib5 = 5, and so on)

Let's dive into the code:

    lines 2 through 6: the class constructor prints a message (we'll use this to trace the class's behavior), prepares some variables (__n to store the series limit, __i to track the current Fibonacci number to provide, and __p1 along with __p2 to save the two previous numbers);

    lines 8 through 10: the __iter__ method is obliged to return the iterator object itself; its purpose may be a bit ambiguous here, but there's no mystery; try to imagine an object which is not an iterator (e.g., it's a collection of some entities), but one of its components is an iterator able to scan the collection; the __iter__ method should extract the iterator and entrust it with the execution of the iteration protocol; as you can see, the method starts its action by printing a message;

    lines 12 through 21: the __next__ method is responsible for creating the sequence; it's somewhat wordy, but this should make it more readable; first, it prints a message, then it updates the number of desired values, and if it reaches the end of the sequence, the method breaks the iteration by raising the StopIteration exception; the rest of the code is simple, and it precisely reflects the definition we showed you earlier;

    lines 24 and 25 make use of the iterator.


    the iterator object is instantiated first;
    next, Python invokes the __iter__ method to get access to the actual iterator;
    the __next__ method is invoked eleven times - the first ten times produce useful values, while the eleventh terminates the iteration.


#### Generators - where to find them: continued

The previous example shows you a solution where the iterator object is a part of a more complex class.

The code isn't really sophisticated, but it presents the concept in a clear way.

Take a look at the code in the editor.

In [27]:
class Fib:
    def __init__(self, nn):
        self.__n = nn
        self.__i = 0
        self.__p1 = self.__p2 = 1

    def __iter__(self):
        print("Fib iter")
        return self

    def __next__(self):
        self.__i += 1
        if self.__i > self.__n:
            raise StopIteration
        if self.__i in [1, 2]:
            return 1
        ret = self.__p1 + self.__p2
        self.__p1, self.__p2 = self.__p2, ret
        return ret

class Class:
    def __init__(self, n):
        self.__iter = Fib(n)

    def __iter__(self):
        print("Class iter")
        return self.__iter;


object = Class(10)

for i in object:
    print(i)


Class iter
1
1
2
3
5
8
13
21
34
55


##### We've built the Fib iterator into another class (we can say that we've composed it into the Class class). It's instantiated along with Class's object.

The object of the class may be used as an iterator when (and only when) it positively answers to the __iter__ invocation - this class can do it, and if it's invoked in this way, it provides an object able to obey the iteration protocol.

This is why the output of the code is the same as previously, although the object of the Fib class isn't used explicitly inside the for loop's context.


### The yield statement

The iterator protocol isn't particularly difficult to understand and use, but it is also indisputable that the protocol is rather inconvenient.

The main discomfort it brings is the need to save the state of the iteration between subsequent __iter__ invocations.

For example, the **Fib** iterator is forced to precisely store the place in which the last invocation has been stopped (i.e., the evaluated number and the values of the two previous elements). This makes the code larger and less comprehensible.

This is why Python offers a much more effective, convenient, and elegant way of writing iterators.

The concept is fundamentally based on a very specific and powerful mechanism provided by the **yield** keyword.

You may think of the **yield** keyword as a smarter sibling of the **return** statement, with one essential difference.

Take a look at this function:

In [5]:
def fun(n):
    for i in range(n):
        return i


It looks strange, doesn't it? It's clear that the for loop has no chance to finish its first execution, as the return will break it irrevocably.

Moreover, invoking the function won't change anything - the for loop will start from scratch and will be broken immediately.

We can say that such a function is not able to save and restore its state between subsequent invocations.

This also means that a function like this cannot be used as a generator.




We've replaced exactly one word in the code - can you see it?

In [8]:
def fun(n):
    for i in range(n):
        yield i


 We've added yield instead of return. This little amendment turns the function into a generator, and executing the yield statement has some very interesting effects.

First of all, it provides the value of the expression specified after the yield keyword, just like return, but doesn't lose the state of the function.

All the variables' values are frozen, and wait for the next invocation, when the execution is resumed (not taken from scratch, like after return).

There is one important limitation: such a function should not be invoked explicitly as - in fact - it isn't a function anymore; it's a generator object.

The invocation will return the object's identifier, not the series we expect from the generator.

Due to the same reasons, the previous function (the one with the return statement) may only be invoked explicitly, and must not be used as a generator.

#### How to build a generator

Let us show you the new generator in action.

This is how we can use it:

In [9]:
def fun(n):
    for i in range(n):
        yield i


for v in fun(5):
    print(v)


0
1
2
3
4


In [18]:
def why(n):
    for i in range(n):
        yield i

for v in why(5):
    print(v)
    

0
1
2
3
4


#### How to build your own generator

What if you need a generator to produce the first n powers of 2?

Nothing easier. Just look at the code below:

In [22]:
def powers_of_2(n):
    power = 1
    for i in range(n):
        yield power
        power *= 2


for v in powers_of_2(8):
    print(v)

1
2
4
8
16
32
64
128


#### List comprehensions

Generators may also be used within list comprehensions, just like here:

In [23]:
def powers_of_2(n):
    power = 1
    for i in range(n):
        yield power
        power *= 2


t = [x for x in powers_of_2(5)]
print(t)



[1, 2, 4, 8, 16]



#### The list() function

The list() function can transform a series of subsequent generator invocations into a real list:

In [24]:
def powers_of_2(n):
    power = 1
    for i in range(n):
        yield power
        power *= 2


t = list(powers_of_2(3))
print(t)



[1, 2, 4]


### The in operator

Moreover, the context created by the in operator allows you to use a generator, too.

The example shows how to do it:

In [25]:
def powers_of_2(n):
    power = 1
    for i in range(n):
        yield power
        power *= 2


for i in range(20):
    if i in powers_of_2(4):
        print(i)

1
2
4
8


#### The Fibanacci number generator

Now let's see a Fibonacci number generator, and ensure that it looks much better than the objective version based on the direct iterator protocol implementation.

Here it is:

In [26]:
def fibonacci(n):
    p = pp = 1
    for i in range(n):
        if i in [0, 1]:
            yield 1
        else:
            n = p + pp
            pp, p = p, n
            yield n

fibs = list(fibonacci(10))
print(fibs)

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]


#### More about list comprehensions

You should be able to remember the rules governing the creation and use of a very special Python phenomenon named list comprehension - a simple and very impressive way of creating lists and their contents.

In case you need it, we've provided a quick reminder in the editor.

In [28]:
list_1 = []

for ex in range(6):
    list_1.append(10 ** ex)

list_2 = [10 ** ex for ex in range(6)]

print(list_1)
print(list_2)

[1, 10, 100, 1000, 10000, 100000]
[1, 10, 100, 1000, 10000, 100000]


There are two parts inside the code, both creating a list containing a few of the first natural powers of ten.

The former uses a routine way of utilizing the for loop, while the latter makes use of the list comprehension and builds the list in situ, without needing a loop, or any other extended code.

It looks like the list is created inside itself - it's not true, of course, as Python has to perform nearly the same operations as in the first snippet, but it is indisputable that the second formalism is simply more elegant, and lets the reader avoid any unnecessary details.

The example outputs two identical lines containing the following text:

#### More about list comprehensions: continued

There is a very interesting syntax we want to show you now. Its usability is not limited to list comprehensions, but we have to admit that comprehensions are the ideal environment for it.

It's a conditional expression - a way of selecting one of two different values based on the result of a Boolean expression.

Look:

expression_one if condition else expression_two

It may look a bit surprising at first glance, but you have to keep in mind that it is not a conditional instruction. Moreover, it's not an instruction at all. It's an operator.

The value it provides is equal to expression_one when the condition is True, and expression_two otherwise.

A good example will tell you more. Look at the code in the editor.

In [29]:
the_list = []

for x in range(10):
    the_list.append(1 if x % 2 == 0 else 0)

print(the_list)


[1, 0, 1, 0, 1, 0, 1, 0, 1, 0]


The code fills a list with 1's and 0's - if the index of a particular element is odd, the element is set to 0, and to 1 otherwise.

Simple? Maybe not at first glance. Elegant? Indisputably.

Can you use the same trick within a list comprehension? Yes, you can.

#### More about list comprehensions: continued

Look at the example in the editor.

In [31]:
the_list = [1 if x % 2 == 0 else 0 for x in range(10)]

print(the_list)


[1, 0, 1, 0, 1, 0, 1, 0, 1, 0]


Compactness and elegance - these two words come to mind when looking at the code.

So, what do they have in common, generators and list comprehensions? Is there any connection between them? Yes. A rather loose connection, but an unequivocal one.

Just one change can turn any list comprehension into a generator.

####  List comprehensions vs. generators

Now look at the code below and see if you can find the detail that turns a list comprehension into a generator:

In [32]:
the_list = [1 if x % 2 == 0 else 0 for x in range(10)]
the_generator = (1 if x % 2 == 0 else 0 for x in range(10))

for v in the_list:
    print(v, end=" ")
print()

for v in the_generator:
    print(v, end=" ")
print()

1 0 1 0 1 0 1 0 1 0 
1 0 1 0 1 0 1 0 1 0 


It's the parentheses. The brackets make a comprehension, the parentheses make a generator.

The code, however, when run, produces two identical lines:

How can you know that the second assignment creates a generator, not a list?

There is some proof we can show you. Apply the len() function to both these entities.

len(the_list) will evaluate to 10. Clear and predictable. len(the_generator) will raise an exception, and you will see the following message:

In [33]:
len(the_list)

10

In [34]:
len(the_generator)

TypeError: object of type 'generator' has no len()

Of course, saving either the list or the generator is not necessary - you can create them exactly in the place where you need them - just like here:

In [35]:
for v in [1 if x % 2 == 0 else 0 for x in range(10)]:
    print(v, end=" ")
print()

for v in (1 if x % 2 == 0 else 0 for x in range(10)):
    print(v, end=" ")
print()



1 0 1 0 1 0 1 0 1 0 
1 0 1 0 1 0 1 0 1 0 


Note: the same appearance of the output doesn't mean that both loops work in the same way. In the first loop, the list is created (and iterated through) as a whole - it actually exists when the loop is being executed.

In the second loop, there is no list at all - there are only subsequent values produced by the generator, one by one.

Carry out your own experiments.

### The lambda function

The lambda function is a concept borrowed from mathematics, more specifically, from a part called the Lambda calculus, but these two phenomena are not the same.

Mathematicians use the Lambda calculus in many formal systems connected with logic, recursion, or theorem provability. Programmers use the lambda function to simplify the code, to make it clearer and easier to understand.

A lambda function is a function without a name (you can also call it an anonymous function). Of course, such a statement immediately raises the question: how do you use anything that cannot be identified?

Fortunately, it's not a problem, as you can name such a function if you really need, but, in fact, in many cases the lambda function can exist and work while remaining fully incognito.

The declaration of the lambda function doesn't resemble a normal function declaration in any way - see for yourself:

In [None]:
lambda parameters: expression

Such a clause returns the value of the expression when taking into account the current value of the current lambda argument.

As usual, an example will be helpful. Our example uses three lambda functions, but gives them names. Look at it carefully:

In [1]:
two = lambda: 2
sqr = lambda x: x * x
pwr = lambda x, y: x ** y

for a in range(-2, 3):
    print(sqr(a), end=" ")
    print(pwr(a, two()))

4 4
1 1
0 0
1 1
4 4


Let's analzye it:

    the first lambda is an anonymous parameterless function that always returns 2. As we've assigned it to a variable named two, we can say that the function is not anonymous anymore, and we can use the name to invoke it.

    the second one is a one-parameter anonymous function that returns the value of its squared argument. We've named it as such, too.

    the third lambda takes two parameters and returns the value of the first one raised to the power of the second one. The name of the variable which carries the lambda speaks for itself. We don't use pow to avoid confusion with the built-in function of the same name and the same purpose.


This example is clear enough to show how lambdas are declared and how they behave, but it says nothing about why they're necessary, and what they're used for, since they can all be replaced with routine Python functions.

Where is the benefit?

#### How to use lambdas and what for?

The most interesting part of using lambdas appears when you can use them in their pure form - as anonymous parts of code intended to evaluate a result.

Imagine that we need a function (we'll name it print_function) which prints the values of a given (other) function for a set of selected arguments.

We want print_function to be universal - it should accept a set of arguments put in a list and a function to be evaluated, both as arguments - we don't want to hardcode anything.

Look at the example in the editor. This is how we've implemented the idea.

In [2]:
def print_function(args, fun):
    for x in args:
        print('f(', x,')=', fun(x), sep='')


def poly(x):
    return 2 * x**2 - 4 * x + 2


print_function([x for x in range(-2, 3)], poly)


f(-2)=18
f(-1)=8
f(0)=2
f(1)=0
f(2)=2


Let's analyze it. The print_function() function takes two parameters:

    the first, a list of arguments for which we want to print the results;
    the second, a function which should be invoked as many times as the number of values that are collected inside the first parameter.

Note: we've also defined a function named poly() - this is the function whose values we're going to print. The calculation the function performs isn't very sophisticated - it's the polynomial (hence its name) of a form:

f(x) = 2x2 - 4x + 2

The name of the function is then passed to the print_function() along with a set of five different arguments - the set is built with a list comprehension clause.

Can we avoid defining the poly() function, as we're not going to use it more than once? Yes, we can - this is the benefit a lambda can bring.

Look at the example below. Can you see the difference?

In [3]:
def print_function(args, fun):
    for x in args:
        print('f(', x,')=', fun(x), sep='')

print_function([x for x in range(-2, 3)], lambda x: 2 * x**2 - 4 * x + 2)



f(-2)=18
f(-1)=8
f(0)=2
f(1)=0
f(2)=2


The print_function() has remained exactly the same, but there is no poly() function. We don't need it anymore, as the polynomial is now directly inside the print_function() invocation in the form of a lambda defined in the following way:

In [4]:
lambda x: 2 * x**2 - 4 * x + 2


<function __main__.<lambda>(x)>

The code has become shorter, clearer, and more legible.

Let us show you another place where lambdas can be useful. We'll start with a description of map(), a built-in Python function. Its name isn't too descriptive, its idea is simple, and the function itself is really usable.

### Lambdas and the map() function

In the simplest of all possible cases, the map() function:

In [None]:
map(function, list)

takes two arguments:

    a function;
    a list.

The above description is extremely simplified, as:

    the second map() argument may be any entity that can be iterated (e.g., a tuple, or just a generator)
    map() can accept more than two arguments.

The map() function applies the function passed by its first argument to all its second argument's elements, and returns an iterator delivering all subsequent function results.

You can use the resulting iterator in a loop, or convert it into a list using the list() function.

Can you see a role for any lambda here?

Look at the code in the editor - we've used two lambdas in it.

In [8]:
list_1 = [x for x in range(5)]
list_2 = list(map(lambda x: 2 ** x, list_1))
print(list_2)

for x in map(lambda x: x * x, list_2):
    print(x, end=' ')
print()

[1, 2, 4, 8, 16]
1 4 16 64 256 


This is the intrigue:

    build the list_1 with values from 0 to 4;
    next, use map along with the first lambda to create a new list in which all elements have been evaluated as 2 raised to the power taken from the corresponding element from list_1;
    list_2 is printed then;
    in the next step, use the map() function again to make use of the generator it returns and to directly print all the values it delivers; as you can see, we've engaged the second lambda here - it just squares each element from list_2.

Try to imagine the same code without lambdas. Would it be any better? It's unlikely.

### Lambdas and the filter() function

Another Python function which can be significantly beautified by the application of a lambda is filter().

It expects the same kind of arguments as map(), but does something different - it filters its second argument while being guided by directions flowing from the function specified as the first argument (the function is invoked for each list element, just like in map()).

The elements which return True from the function pass the filter - the others are rejected.

The example in the editor shows the filter() function in action.

In [13]:
from random import seed, randint

seed()
data = [randint(-10,10) for x in range(5)]
filtered = list(filter(lambda x: x > 0 and x % 2 == 0, data))

print(data)
print(filtered)

[5, 2, 9, 7, -2]
[2]


Note: we've made use of the random module to initialize the random number generator (not to be confused with the generators we've just talked about) with the seed() function, and to produce five random integer values from -10 to 10 using the randint() function.

The list is then filtered, and only the numbers which are even and greater than zero are accepted.

Of course, it's not likely that you'll receive the same results, but this is what our results looked like:

### A brief look at closures

Let's start with a definition: closure is a technique which allows the storing of values in spite of the fact that the context in which they have been created does not exist anymore. Intricate? A bit.

Let's analyze a simple example:

In [15]:
def outer(par):
    loc = par


var = 1
outer(var)

print(par)
print(loc)


NameError: name 'par' is not defined

The example is obviously erroneous.

The last two lines will cause a NameError exception – neither par nor loc is accessible outside the function. Both the variables exist when and only when the outer() function is being executed.

Look at the example in the editor. We've modified the code significantly.

In [14]:
def outer(par):
    loc = par

    def inner():
        return loc
    return inner


var = 1
fun = outer(var)
print(fun())

1


There is a brand new element in it – a function (named inner) inside another function (named outer).

How does it work? Just like any other function except for the fact that inner() may be invoked only from within outer(). We can say that inner() is outer()'s private tool – no other part of the code can access it.

Look carefully:

    the inner() function returns the value of the variable accessible inside its scope, as inner() can use any of the entities at the disposal of outer()
    the outer() function returns the inner() function itself; more precisely, it returns a copy of the inner() function, the one which was frozen at the moment of outer()'s invocation; the frozen function contains its full environment, including the state of all local variables, which also means that the value of loc is successfully retained, although outer() ceased to exist a long time ago.


### A brief look at closures: continued

A closure has to be invoked in exactly the same way in which it has been declared.

In the example below:

In [1]:
def outer(par):
    loc = par

    def inner():
        return loc
    return inner


var = 1
fun = outer(var)
print(fun())



1


the inner() function is parameterless, so we have to invoke it without arguments.

Now look at the code in the editor. It is fully possible to declare a closure equipped with an arbitrary number of parameters, e.g., one, just like the power() function.

This means that the closure not only makes use of the frozen environment, but it can also modify its behavior by using values taken from the outside.

This example shows one more interesting circumstance - you can create as many closures as you want using one and the same piece of code. This is done with a function named make_closure(). Note:

    the first closure obtained from make_closure() defines a tool squaring its argument;
    the second one is designed to cube the argument.

This is why the code produces the following output:

In [3]:
def make_closure(par):
    loc = par

    def power(p):
        return p ** loc
    return power


fsqr = make_closure(2)
fcub = make_closure(3)

for i in range(5):
    print(i, fsqr(i), fcub(i))


0 0 0
1 1 1
2 4 8
3 9 27
4 16 64



### Accessing files from Python code

One of the most common issues in the developer's job is to process data stored in files while the files are usually physically stored using storage devices - hard, optical, network, or solid-state disks.

It's easy to imagine a program that sorts 20 numbers, and it's equally easy to imagine the user of this program entering these twenty numbers directly from the keyboard.

It's much harder to imagine the same task when there are 20,000 numbers to be sorted, and there isn't a single user who is able to enter these numbers without making a mistake.

It's much easier to imagine that these numbers are stored in the disk file which is read by the program. The program sorts the numbers and doesn't send them to the screen, but instead creates a new file and saves the sorted sequence of numbers there.

If we want to implement a simple database, the only way to store the information between program runs is to save it into a file (or files if your database is more complex).



In principle, any non-simple programming problem relies on the use of files, whether it processes images (stored in files), multiplies matrices (stored in files), or calculates wages and taxes (reading data stored in files).
The concept of file storages

You may ask why we have waited until now to show you these issues.

The answer is very simple - Python's way of accessing and processing files is implemented using a consistent set of objects. There is no better moment to talk about it.



### File names

Different operating systems can treat the files in different ways. For example, Windows uses a different naming convention than the one adopted in Unix/Linux systems.

If we use the notion of a canonical file name (a name which uniquely defines the location of the file regardless of its level in the directory tree) we can realize that these names look different in Windows and in Unix/Linux:

##### The concept of file paths
    for window:  C:\directory\file
    for Linux:   /directory/files



As you can see, systems derived from Unix/Linux don't use the disk drive letter (e.g., C:) and all the directories grow from one root directory called /, while Windows systems recognize the root directory as \.


In addition, Unix/Linux system file names are case-sensitive. Windows systems store the case of letters used in the file name, but don't distinguish between their cases at all.

This means that these two strings: ThisIsTheNameOfTheFile and thisisthenameofthefile describe two different files in Unix/Linux systems, but are the same name for just one file in Windows systems.

The main and most striking difference is that you have to use two different separators for the directory names: \ in Windows, and / in Unix/Linux.

This difference is not very important to the normal user, but is very important when writing programs in Python.

To understand why, try to recall the very specific role played by the \ inside Python strings. 

#### File names: continued

Suppose you're interested in a particular file located in the directory dir, and named file.

Suppose also that you want to assign a string containing the name of the file.

In Unix/Linux systems, it may look as follows:

In [3]:
name = "/dir/file"

But if you try to code it for the Windows system:

In [None]:
name = "\dir\file"

you'll get an unpleasant surprise: either Python will generate an error, or the execution of the program will behave strangely, as if the file name has been distorted in some way.

In fact, it's not strange at all, but quite obvious and natural. Python uses the \ as an escape character (like \n).

This means that Windows file names must be written as follows:

In [5]:
name = "\\dir\\file"


Fortunately, there is also one more solution. Python is smart enough to be able to convert slashes into backslashes each time it discovers that it's required by the OS.

This means that any the following assignments:

In [6]:
name = "/dir/file"
name = "c:/dir/file"

Opening the streams

The opening of the stream is performed by a function which can be invoked in the following way:

In [None]:
stream = open(file, mode = 'r', encoding = None)

Let's analyze it:

    the name of the function (open) speaks for itself; if the opening is successful, the function returns a stream object; otherwise, an exception is raised (e.g., FileNotFoundError if the file you're going to read doesn't exist);

    the first parameter of the function (file) specifies the name of the file to be associated with the stream;

    the second parameter (mode) specifies the open mode used for the stream; it's a string filled with a sequence of characters, and each of them has its own special meaning (more details soon);

    the third parameter (encoding) specifies the encoding type (e.g., UTF-8 when working with text files)

    the opening must be the very first operation performed on the stream.

Note: the mode and encoding arguments may be omitted - their default values are assumed then. The default opening mode is reading in text mode, while the default encoding depends on the platform used.

Let us now present you with the most important and useful open modes. Ready?

### Opening the streams: modes

r open mode: read

    the stream will be opened in read mode;
    the file associated with the stream must exist and has to be readable, otherwise the open() function raises an exception.

w open mode: write

    the stream will be opened in write mode;
    the file associated with the stream doesn't need to exist; if it doesn't exist it will be created; if it exists, it will be truncated to the length of zero (erased); if the creation isn't possible (e.g., due to system permissions) the open() function raises an exception.

a open mode: append

    the stream will be opened in append mode;
    the file associated with the stream doesn't need to exist; if it doesn't exist, it will be created; if it exists the virtual recording head will be set at the end of the file (the previous content of the file remains untouched.)

r+ open mode: read and update

    the stream will be opened in read and update mode;
    the file associated with the stream must exist and has to be writeable, otherwise the open() function raises an exception;
    both read and write operations are allowed for the stream.

w+ open mode: write and update

    the stream will be opened in write and update mode;
    the file associated with the stream doesn't need to exist; if it doesn't exist, it will be created; the previous content of the file remains untouched;
    both read and write operations are allowed for the stream.



Selecting text and binary modes

If there is a letter b at the end of the mode string it means that the stream is to be opened in the binary mode.

If the mode string ends with a letter t the stream is opened in the text mode.

Text mode is the default behaviour assumed when no binary/text mode specifier is used.

Finally, the successful opening of the file will set the current file position (the virtual reading/writing head) before the first byte of the file if the mode is not a and after the last byte of file if the mode is set to a.

          Text mode 	  Binary mode    	Description
             rt 	            rb 	             read
             wt          	    wb 	            write
             at 	            ab 	            append
             r+t                r+b 	    read and update
             w+t                w+b 	    write and update

EXTRA

You can also open a file for its exclusive creation. You can do this using the x open mode. If the file already exists, the open() function will raise an exception.


Prev Next

### Opening the stream for the first time

Imagine that we want to develop a program that reads content of the text file named: C:\Users\User\Desktop\file.txt.

How to open that file for reading? Here's the relevant snippet of the code:

In [None]:
try:
    stream = open("C:\Users\User\Desktop\file.txt", "rt")
    # Processing goes here.
    stream.close()
except Exception as exc:
    print("Cannot open the file:", exc)

#### What's going on here?

    we open the try-except block as we want to handle runtime errors softly;
    we use the open() function to try to open the specified file (note the way we've specified the file name)
    the open mode is defined as text to read (as text is the default setting, we can skip the t in mode string)
    in case of success we get an object from the open() function and we assign it to the stream variable;
    if open() fails, we handle the exception printing full error information (it's definitely good to know what exactly happened)

Pre-opened streams

We said earlier that any stream operation must be preceded by the open() function invocation. There are three well-defined exceptions to the rule.

When our program starts, the three streams are already opened and don't require any extra preparations. What's more, your program can use these streams explicitly if you take care to import the sys module:

In [None]:
import sys

because that's where the declaration of the three streams is placed.



The names of these streams are: sys.stdin, sys.stdout, and sys.stderr.

Let's analyze them:

    sys.stdin
        stdin (as standard input)
        the stdin stream is normally associated with the keyboard, pre-open for reading and regarded as the primary data source for the running programs;
        the well-known input() function reads data from stdin by default.

    sys.stdout
        stdout (as standard output)
        the stdout stream is normally associated with the screen, pre-open for writing, regarded as the primary target for outputting data by the running program;
        the well-known print() function outputs the data to the stdout stream.

    sys.stderr
        stderr (as standard error output)
        the stderr stream is normally associated with the screen, pre-open for writing, regarded as the primary place where the running program should send information on the errors encountered during its work;
        we haven't presented any method to send the data to this stream (we will do it soon, we promise)
        the separation of stdout (useful results produced by the program) from the stderr (error messages, undeniably useful but does not provide results) gives the possibility of redirecting these two types of information to the different targets. More extensive discussion of this issue is beyond the scope of our course. The operation system handbook will provide more information on these issues.


#### Let's take a look at some selected constants useful for detecting stream errors:

    errno.EACCES → Permission denied

    The error occurs when you try, for example, to open a file with the read only attribute for writing.

    errno.EBADF → Bad file number

    The error occurs when you try, for example, to operate with an unopened stream.

    errno.EEXIST → File exists

    The error occurs when you try, for example, to rename a file with its previous name.

    errno.EFBIG → File too large

    The error occurs when you try to create a file that is larger than the maximum allowed by the operating system.

    errno.EISDIR → Is a directory

    The error occurs when you try to treat a directory name as the name of an ordinary file.

    errno.EMFILE → Too many open files

    The error occurs when you try to simultaneously open more streams than acceptable for your operating system.

    errno.ENOENT → No such file or directory

    The error occurs when you try to access a non-existent file/directory.

    errno.ENOSPC → No space left on device

    The error occurs when there is no free space on the media.

The complete list is much longer (it includes also some error codes not related to the stream processing.)

#### Diagnosing stream problems: continued

If you are a very careful programmer, you may feel the need to use the sequence of statements similar to those presented in the editor.

Fortunately, there is a function that can dramatically simplify the error handling code.

Its name is strerror(), and it comes from the os module and expects just one argument - an error number.

Its role is simple: you give an error number and get a string describing the meaning of the error.

Note: if you pass a non-existent error code (a number which is not bound to any actual error), the function will raise ValueError exception.

Now we can simplify our code in the following way:

In [11]:
from os import strerror

try:
    s = open("c:/users/user/Desktop/file.txt", "rt")
    # Actual processing goes here.
    s.close()
except Exception as exc:
    print("The file could not be opened:", strerror(exc.errno))

The file could not be opened: No such file or directory


Okay. Now it's time to deal with text files and get familiar with some basic techniques you can use to process them.



In [12]:
import errno

try:
    s = open("c:/users/user/Desktop/file.txt", "rt")
    # Actual processing goes here.
    s.close()
except Exception as exc:
    if exc.errno == errno.ENOENT:
        print("The file doesn't exist.")
    elif exc.errno == errno.EMFILE:
        print("You've opened too many files.")
    else:
        print("The error number is:", exc.errno)

The file doesn't exist.


### Key takeaways

1. A file needs to be open before it can be processed by a program, and it should be closed when the processing is finished.

Opening the file associates it with the stream, which is an abstract representation of the physical data stored on the media. The way in which the stream is processed is called open mode. Three open modes exist:

    read mode – only read operations are allowed;
    write mode – only write operations are allowed;
    update mode – both writes and reads are allowed.


2. Depending on the physical file content, different Python classes can be used to process files. In general, the BufferedIOBase is able to process any file, while TextIOBase is a specialized class dedicated to processing text files (i.e. files containing human-visible texts divided into lines using new-line markers). Thus, the streams can be divided into binary and text ones.

3. The following open() function syntax is used to open a file:

open(file_name, mode=open_mode, encoding=text_encoding)

The invocation creates a stream object and associates it with the file named file_name, using the specified open_mode and setting the specified text_encoding, or it raises an exception in the case of an error.

4. Three predefined streams are already open when the program starts:

    sys.stdin – standard input;
    sys.stdout – standard output;
    sys.stderr – standard error output.


5. The IOError exception object, created when any file operations fails (including open operations), contains a property named errno, which contains the completion code of the failed action. Use this value to diagnose the problem.



####  Processing text files

In this lesson we're going to prepare a simple text file with some short, simple content.

We're going to show you some basic techniques you can utilize to read the file contents in order to process them.

The processing will be very simple - you're going to copy the file's contents to the console, and count all the characters the program has read in.

But remember - our understanding of a text file is very strict. In our sense, it's a plain text file - it may contain only text, without any additional decorations (formatting, different fonts, etc.).

That's why you should avoid creating the file using any advanced text processor like MS Word, LibreOffice Writer, or something like this. Use the very basics your OS offers: Notepad, vim, gedit, etc.

If your text files contain some national characters not covered by the standard ASCII charset, you may need an additional step. Your open() function invocation may require an argument denoting specific text encoding.

For example, if you're using a Unix/Linux OS configured to use UTF-8 as a system-wide setting, the open() function may look as follows:
stream = open('file.txt', 'rt', encoding='utf-8')


where the encoding argument has to be set to a value which is a string representing proper text encoding (UTF-8, here).

Consult your OS documentation to find an encoding name adequate to your environment.

Note

For the purposes of our experiments with file processing carried out in this section, we're going to use a pre-uploaded set of files (i.e., tzop.txt, or text.txt files) which you'll be able to work with. If you'd like to work with your own files locally on your machine, we strongly encourage you to do so, and to use IDLE (or any other IDE that you may prefer) to carry out your own tests.


In [2]:
# Opening tzop.txt in read mode, returning it as a file object:
stream = open("tzop.txt", "rt", encoding = "utf-8")

print(stream.read()) # printing the content of the file


FileNotFoundError: [Errno 2] No such file or directory: 'tzop.txt'

#### Processing text files: continued

Reading a text file's contents can be performed using several different methods - none of them is any better or worse than any other. It's up to you which of them you prefer and like.

Some of them will sometimes be handier, and sometimes more troublesome. Be flexible. Don't be afraid to change your preferences.

The most basic of these methods is the one offered by the read() function, which you were able to see in action in the previous lesson.

If applied to a text file, the function is able to:

    read a desired number of characters (including just one) from the file, and return them as a string;
    read all the file contents, and return them as a string;
    if there is nothing more to read (the virtual reading head reaches the end of the file), the function returns an empty string.


We'll start with the simplest variant and use a file named text.txt. The file has the following contents:

In [None]:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.


Now look at the code in the editor, and let's analyze it.

In [14]:
from os import strerror

try:
    cnt = 0
    s = open('text.txt', "rt")
    ch = s.read(1)
    while ch != '':
        print(ch, end='')
        cnt += 1
        ch = s.read(1)
    s.close()
    print("\n\nCharacters in file:", cnt)
except IOError as e:
    print("I/O error occurred: ", strerror(e.errno))


I/O error occurred:  No such file or directory


The routine is rather simple:

    use the try-except mechanism and open the file of the predetermined name (text.txt in our case)
    try to read the very first character from the file (ch = s.read(1))
    if you succeed (this is proven by a positive result of the while condition check), output the character (note the end= argument - it's important! You don't want to skip to a new line after every character!);
    update the counter (cnt), too;
    try to read the next character, and the process repeats.



#### Processing text files: continued

If you're absolutely sure that the file's length is safe and you can read the whole file to the memory at once, you can do it - the read() function, invoked without any arguments or with an argument that evaluates to None, will do the job for you.

Remember - reading a terabyte-long file using this method may corrupt your OS.

Don't expect miracles - computer memory isn't stretchable.

In [1]:
from os import strerror

try:
    cnt = 0
    s = open('text.txt', "rt")
    content = s.read()
    for ch in content:
        print(ch, end='')
        cnt += 1
    s.close()
    print("\n\nCharacters in file:", cnt)
except IOError as e:
    print("I/O error occurred: ", strerr(e.errno))


NameError: name 'strerr' is not defined

In [None]:


### What is a bytearray?

Before we start talking about binary files, we have to tell you about one of the specialized classes Python uses to store amorphous data.

Amorphous data is data which have no specific shape or form - they are just a series of bytes.

This doesn't mean that these bytes cannot have their own meaning, or cannot represent any useful object, e.g., bitmap graphics.