# io modules

The Input Output module ```io``` is used for reading and writing data to a file.

## Text Files

A text file called ```text.txt``` can be created in the same folder as the Interactive Python Notebook File:

```
Baa, baa, black sheep,
Have you any wool?
Yes, sir, yes, sir,
Three bags full.

One for my master,
One for my dame,
And one for the little boy
Who lives down the lane.
```

Text files can be viewed in Notepad++ with View → Show Symbol → Show All Characters:

<img src='./images/img_001.png' alt='img_001' width='500'/>

Notice that there is a ```CRLF``` at the end of each line instructing to move onto the next row. This stands for carriage return and line feed.

### open function

The ```open``` function in the ```io``` module is used for opening text and binary files. The module can be imported and the docstring viewed:

In [1]:
import io

In [2]:
io.open?

[1;31mSignature:[0m
[0mio[0m[1;33m.[0m[0mopen[0m[1;33m([0m[1;33m
[0m    [0mfile[0m[1;33m,[0m[1;33m
[0m    [0mmode[0m[1;33m=[0m[1;34m'r'[0m[1;33m,[0m[1;33m
[0m    [0mbuffering[0m[1;33m=[0m[1;33m-[0m[1;36m1[0m[1;33m,[0m[1;33m
[0m    [0mencoding[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0merrors[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mnewline[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mclosefd[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mopener[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Open file and return a stream.  Raise OSError upon failure.

file is either a text or byte string giving the name (and the path
if the file isn't in the current working directory) of the file to
be opened or an integer file descriptor of the file to be
wrapped. (If a file descriptor is given, it is closed when the
returne

Because this function is so commonly used, a copy of it is included in ```builtins```:

In [3]:
open?

[1;31mSignature:[0m
[0mopen[0m[1;33m([0m[1;33m
[0m    [0mfile[0m[1;33m,[0m[1;33m
[0m    [0mmode[0m[1;33m=[0m[1;34m'r'[0m[1;33m,[0m[1;33m
[0m    [0mbuffering[0m[1;33m=[0m[1;33m-[0m[1;36m1[0m[1;33m,[0m[1;33m
[0m    [0mencoding[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0merrors[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mnewline[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mclosefd[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mopener[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Open file and return a stream.  Raise OSError upon failure.

file is either a text or byte string giving the name (and the path
if the file isn't in the current working directory) of the file to
be opened or an integer file descriptor of the file to be
wrapped. (If a file descriptor is given, it is closed when the
returned I/O object is closed

The ```open``` function requires a file which can be specified directly when it is in the same folder as the interactive Python notebook file (or Python script file).

The ```mode``` keyword input argument can be specified using a single letter:

|mode|definition|
|---|---|
|'r'|open an existing file and read existing content|
|'w'|open an existing file and write over existing content|
|'a'|open an existing file and append new content|
|'x'|create a new file and write new content|

The ```encoding``` keyword argument is used to specify the encoding, which recall was discussed in detail when the ```bytes``` class examined in a previous notebook. The encoding has a default value of ```'utf-8'``` but if the data was processed with a Microsoft Product may require ```'utf-8-sig'``` in order to remove an unwanted BOM. To recap:

|encoding|bytes per character|bits per character|byte order|byte order marker BOM|
|---|---|---|---|---|
|'utf-8'|1, 2, 3, 4|8, 16, 24, 32|big endian| |
|'utf-8-sig'|1, 2, 3, 4|8, 16, 24, 32|big endian|efbbbf|
|'utf-32'|4|32|little endian|fffe0000|
|'utf-32-le'|4|32|little endian| |
|'utf-32-be'|4|32|big endian| |
|'utf-16'|2|16|little endian|fffe|
|'utf-16-le'|2|16|little endian| |
|'utf-16-be'|2|16|big endian| |
|'latin1'|1|8| ||
|'ascii'|1|8| ||

The ```newline``` keyword input argument can be used to specify the character that is used to represent a new line.

On Linux this is normally just the new line character escape character ```'\n'```. 

On Windows two escape characters carriage return and new line are used ```'\r\n'```.

The ```errors``` keyword input argument is used to handle errors, normally due to encoding issues and are set to ```'strict'``` by default.

A file can be opened:

In [4]:
file = open('text.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n')

Under the hood, this is actually the initialisation signature of the ```TextIOWrapper``` class and a new instance is created with the instance name ```file```. This can be seen when the datatype of ```file``` is checked:

In [5]:
type(file)

_io.TextIOWrapper

The ```_io``` indicates that this class is from the ```io``` module. The prefix with an underscore means the module is internally being used here. The ```open``` function is essentially equivalent to the initialisation method of this class:

In [6]:
io.TextIOWrapper?

[1;31mInit signature:[0m
[0mio[0m[1;33m.[0m[0mTextIOWrapper[0m[1;33m([0m[1;33m
[0m    [0mbuffer[0m[1;33m,[0m[1;33m
[0m    [0mencoding[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0merrors[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mnewline[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mline_buffering[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mwrite_through[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Character and line based layer over a BufferedIOBase object, buffer.

encoding gives the name of the encoding that the stream will be
decoded or encoded with. It defaults to locale.getencoding().

errors determines the strictness of encoding and decoding (see
help(codecs.Codec) or the documentation for codecs.register) and
defaults to "strict".

newline controls how line endings are handled. It can be None, '',
'\n', '\r', and '\

Data can be read from the ```TextIOWrapper``` instance for example using the files ```readlines``` method returning a list of strings:

In [7]:
file.readlines()

['Baa, baa, black sheep,\r\n',
 'Have you any wool?\r\n',
 'Yes, sir, yes, sir,\r\n',
 'Three bags full.\r\n',
 '\r\n',
 'One for my master,\r\n',
 'One for my dame,\r\n',
 'And one for the little boy\r\n',
 'Who lives down the lane.']

After working with a file, it should be closed. The file can be closed using the ```TextIOWrapper``` close method:

In [8]:
file.close()

The ```print_identifier_group``` function from the custom ```helper_module``` can be imported to view the identifiers in more detail:

In [9]:
from helper_module import print_identifier_group

The ```TextIOWrapper``` class has the standard ```object``` based datamodel attributes and identifiers seen before. There are some additions such as ```__enter__``` (*dunder enter*) and ```__exit__``` (*dunder exit*):

In [10]:
print_identifier_group(file, kind='datamodel_attribute')

['__dict__', '__doc__']


In [11]:
print_identifier_group(file, kind='datamodel_method')

['__class__', '__del__', '__delattr__', '__dir__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']


The ```TextIOWrapper``` class has a number of attributes, most of these correspond to the input arguments provided when initialising the instance:

In [12]:
print_identifier_group(file, kind='attribute')

['_CHUNK_SIZE', '_finalizing', 'buffer', 'closed', 'encoding', 'errors', 'line_buffering', 'mode', 'name', 'newlines', 'write_through']


The ```TextIOWrapper``` method ```readable``` will check whether a file is readable returning a boolean. The method ```read``` will read the entire file as a Unicode string, the method ```readline``` will read an individual line as a string and then advance, while the method ```readlines``` will read every line returning a list of Unicode strings corresponding to each line.

The methods ```writable```, ```write``` and ```writelines``` are the write counterparts.

The methods ```seekable``` and ```seek``` relate to the cursor position.

In [13]:
print_identifier_group(file, kind='method')

['_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', 'close', 'detach', 'fileno', 'flush', 'isatty', 'read', 'readable', 'readline', 'readlines', 'reconfigure', 'seek', 'seekable', 'tell', 'truncate', 'writable', 'write', 'writelines']


### with code block

The datamodel methods ```__enter__``` (*dunder enter*) and ```__exit__``` (*dunder exit*) are used by a ```with``` code block to open the file when the code block begins and close the file when the code block is exited respectively:

In [14]:
with open('text.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n') as file:
    print(file.name)
    print(file.mode)
    print(file.encoding)
    print(file.errors)
    print('readable: ', file.readable())
    print('writeable: ', file.writable())
    print('seekable: ', file.seekable())

text.txt
r
utf-8
strict
readable:  True
writeable:  False
seekable:  True


### read

The ```TextIOWrapper``` method ```read``` can be used to read the entire contents of the text file as a single string, notice that this includes carriage returns and new line escape characters:

In [15]:
with open('text.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n') as file:
    data1 = file.read()

In [16]:
data1

'Baa, baa, black sheep,\r\nHave you any wool?\r\nYes, sir, yes, sir,\r\nThree bags full.\r\n\r\nOne for my master,\r\nOne for my dame,\r\nAnd one for the little boy\r\nWho lives down the lane.'

When printed this gives a similar display to the file opened in Notepad:

In [17]:
print(data1)

Baa, baa, black sheep,
Have you any wool?
Yes, sir, yes, sir,
Three bags full.

One for my master,
One for my dame,
And one for the little boy
Who lives down the lane.


The ```TextIOWrapper``` method ```readline``` will only read a single line:

In [18]:
with open('text.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n') as file:
    data2 = file.readline()

Note that the end of this string includes the carriage return and new line escape characters:

In [19]:
data2

'Baa, baa, black sheep,\r\n'

These whitespace characters can be stripped using the string method ```strip```:

In [20]:
data2.strip()

'Baa, baa, black sheep,'

The ```TextIOWrapper``` method ```readlines``` will instead output a list of strings:

In [21]:
with open('text.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n') as file:
    data3 = file.readlines()

Notice the square brackets ```[]``` enclosing the list and comma delimiter between each line. Each line also includes the carriage return and newline character:

In [22]:
data3

['Baa, baa, black sheep,\r\n',
 'Have you any wool?\r\n',
 'Yes, sir, yes, sir,\r\n',
 'Three bags full.\r\n',
 '\r\n',
 'One for my master,\r\n',
 'One for my dame,\r\n',
 'And one for the little boy\r\n',
 'Who lives down the lane.']

These can be removed using a list comprehension:

In [23]:
data4 = [line.strip() for line in data3]

In [24]:
data4

['Baa, baa, black sheep,',
 'Have you any wool?',
 'Yes, sir, yes, sir,',
 'Three bags full.',
 '',
 'One for my master,',
 'One for my dame,',
 'And one for the little boy',
 'Who lives down the lane.']

### seek

The length of the single string obtained using the ```TextIOWrapper``` method ```read``` is:

In [25]:
len(data1)

175

In [26]:
data1

'Baa, baa, black sheep,\r\nHave you any wool?\r\nYes, sir, yes, sir,\r\nThree bags full.\r\n\r\nOne for my master,\r\nOne for my dame,\r\nAnd one for the little boy\r\nWho lives down the lane.'

The 5th character and onwards can be seen by indexing into the string using the slice:

In [27]:
data1[5:]

'baa, black sheep,\r\nHave you any wool?\r\nYes, sir, yes, sir,\r\nThree bags full.\r\n\r\nOne for my master,\r\nOne for my dame,\r\nAnd one for the little boy\r\nWho lives down the lane.'

Each character in the file also has a zero-ordered numeric index which the cursor can be placed at using the ```TextIOWrapper``` method ```seek```:

In [28]:
with open('text.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n') as file:
    file.seek(5)
    data5 = file.read()

This gives a similar result to slicing of the string as seen above:

In [29]:
data5

'baa, black sheep,\r\nHave you any wool?\r\nYes, sir, yes, sir,\r\nThree bags full.\r\n\r\nOne for my master,\r\nOne for my dame,\r\nAnd one for the little boy\r\nWho lives down the lane.'

And the length of the string is ```175 - 5```:

In [30]:
len(data5)

170

### write

The ```TextIOWrapper``` method ```write``` can be used to write text to a file, note that this file is opened with ```mode='w'```. 

On Linux the keyword input argument should be ```newline='\n'``` and the ```\n``` should be incorporated in a string for a new line.

On Windows the keyword input argument should be ```newline='\r\n'``` however ```\n``` should be incorporated in a string for a new line. Each ```\n``` used in the ```write``` method will be converted into ```\r\n```.

In [31]:
with open('text2.txt', mode='w', encoding='utf-8', errors='strict', newline='\r\n') as file:
    file.write('Hello World!\nBye World!')

This file can then be read using the ```TextIOWrapper``` method ```read```, note that this file is opened with ```mode='r'``` and ```newline='\r\n'```:

In [32]:
with open('text2.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n') as file:
    data6 = file.read()

In [33]:
data6

'Hello World!\r\nBye World!'

Be careful not to use ```\r\n``` within the ```TextIOWrapper``` method ```write``` as the ```\n``` will be converted into ```\r\n``` and the result will be ```\r\r\n``` which is wrong:

In [34]:
with open('text3.txt', mode='w', encoding='utf-8', errors='strict', newline='\r\n') as file:
    file.write('Hello World!\r\nBye World!')

with open('text3.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n') as file:
    data7 = file.read()
    
data7

'Hello World!\r\r\nBye World!'

The method ```writelines``` can be used to write a list of strings to a file. Note once again that this file is opened with ```mode='w'```. 

On Linux the keyword input argument should be ```newline='\n'``` and the ```\n``` should be at the end of each string.

On Windows the keyword input argument should be ```newline='\r\n'``` however and each string should end using ```\n```. Each ```\n``` used in the ```writelines``` method will be converted into ```\r\n```.

In [35]:
lines = ['Hello World!', 'Bye World!']

with open('text4.txt', mode='w', encoding='utf-8', errors='strict', newline='\r\n') as file:
    file.writelines([line + '\n' for line in lines])

with open('text4.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n') as file:
    data8 = file.read()
    
data8

'Hello World!\r\nBye World!'

### append

When ```mode='w'``` any content in an existing file is removed:

In [36]:
with open('text4.txt', mode='w', encoding='utf-8', errors='strict', newline='\r\n') as file:
    file.write('J')

with open('text4.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n') as file:
    data9 = file.read()
    
data9

'J'

When ```mode='a'``` any content in an existing file is instead appended:

In [37]:
with open('text2.txt', mode='a', encoding='utf-8', errors='strict', newline='\r\n') as file:
    file.write('Appended')

with open('text2.txt', mode='r', encoding='utf-8', errors='strict', newline='\r\n') as file:
    data10 = file.read()
    
data10

'Hello World!\r\nBye World!Appended'

Three variations of ```mode``` were seen ```'r'```, ```'w'``` and ```'a'``` and and have the alias ```'rt'```, ```'wt'``` and ```'at'``` indicating that this function is being used on a text file and a text ```.txt``` file uses a Unicode string with ```'utf-8'``` encoding by default.

## Bin Files

```'rb'```, ```'wb'``` and ```'ab'``` instead indicate that this function is being used on a binary file and a binary ```.bin``` file uses a byte string. There is no ```encoding```, ```errors```, ```newline``` as raw ```bytes``` are used. When in binary mode, all the ```TextIOWrapper``` methods expect a byte string.

In [38]:
with open('text5.bin', mode='wb') as file:
    file.write(b'\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21\x0d\x0a\x48\x65\x6c\x6c\x6f')

with open('text5.bin', mode='rb') as file:
    data11 = file.read()
    
data11

b'Hello World!\r\nHello'

The binary file can be read into text mode if the correct encoding is supplied:

In [39]:
with open('text5.bin', mode='rt', encoding='utf-8', errors='strict', newline='\r\n') as file:
    data12 = file.readlines()
    
data12

['Hello World!\r\n', 'Hello']