#Reading and Writing Files

One main task of using Python is to process information. We can store the inforation to Python built-in data structures in the memory of a computer, such as List, String, or Dictionary. However, when the infomation is really getting big, we have to store them as files on a permanent medium such as a disk, CD, or flash memory. 

FIles in a computer can be organized differently (i.e., different file systems such as NTFS, FAT32, EXT2) in disks by different operating system (Linux, MacOS, Windows). Information in files can also be organized in different formates, such as text, excel, or database files. Accordingly, we will process different files using different modules and functions in Python. 

We will learn how to read and write files in this chapter.


##Text Files and Their Format

Using a text editor such as Notepad, TextEdit, or vi, you can create, view, and save data in a text file. Your Python programs can output data to a text file, a procedure explained later in this section. The data in a text file can be viewed as characters, words, numbers, or lines of text, depending on the text file’s format and on the purposes for which the data are used. When the data are treated as numbers (either integers or floating-points), they must be separated by whitespace characters — spaces, tabs, and newlines. All data output to or input from a text file must be strings. Thus, numbers must be converted to strings before output (write), and these strings must be converted back to numbers after input (read).

###Writing Text to a File
Data can be output to a text file using a file object. Python’s open function, which expects a file pathname and a mode string as arguments, opens a connection to the file on disk and returns a file object. If a path is not specified, then the current path is assumed. The mode string is 'r' for input files and 'w' for output files. Thus, the following code opens a file object (e.g., f) on a file named myfile.txt for output:

In [1]:
f = open('myfile.txt','w')

If the file does not exist, it is created with the given pathname. If the file already exists, Python opens it. When data are written to the file and the file is closed, any data previously existing in the file are erased. String data are written (or output) to a file using the method write with the file object. The write method expects a single string argument. If you want the output text to end with a newline, you must include the escape character \n in the string. The next statement writes two lines of text to the file:

In [2]:
f.write("First line.\nSecond line.\n")

25

When all of the outputs are finished, the file should be closed using the method close, as follows:

In [3]:
f.close()

Failure to close an output file can result in data being lost.

###Writing Numbers to a File

The file method write() expects a string as an argument. Therefore, other types of data, such as integers or floating-point numbers, must first be converted to strings before being written to an output file. In Python, the values of most data types can be converted to strings by using the str() function. The resulting strings are then written to a file with a space or a newline as a separator character.  

The next code segment illustrates the output of integers to a text file. one hundred random integers between 1 and 100 are generated and written to a text file named integers.txt. The newline character is the separator.

In [4]:
import random
f = open("integers.txt", 'w')
for count in range(100):
    number = random.randint(1, 100)
    f.write(str(number) + "\n")
f.close()

###Reading Text from a File

You open a file for input in a manner similar to opening a file for output. The only thing that changes is the operation mode string, which, in the case of opening a file for input (read), is 'r'. However, if the pathname is not accessible from the current working directory, Python raises an error. The example shown below is the code for opening myfile.txt for input.

There are several ways to read data from an input file. The simplest way is to use the file method **read()** to input the **entire** contents of the file as a single string. If the file contains multiple lines of text, the newline characters will be
embedded in this string. 

After input or read() is finished, another call to read would return an empty string, to indicate that the end of the file has been reached. To repeat an input, the file must be re-opened. It is not necessary to close the file.

In [5]:
f = open("myfile.txt", 'r')
text = f.read()
print(text)

First line.
Second line.



Please note the difference between read() and readlines(). The return of read() is the complete content from the file in a single string. Please note, the output may display multiple lines due to embedded "\n". The return of readlines() is a list with each item corresponding to each line in the file.

In [None]:
f = open("myfile.txt", 'r')
text = f.readlines()
print(text)

['First line.\n', 'Second line.\n']


An application might read and process the text one line at a
time. A for loop accomplishes this nicely. The *for* loop views a file object as a sequence of lines of text. On each pass through the loop, the loop variable is bound to the next line of text in the sequence. Here is a session that re-opens our example file and visits the lines of text in it:

In [None]:
f = open("integers.txt", 'r')
for line in f:
    print(line)

47

5

62

89

21

8

84

65

95

63

10

67

61

31

97

3

30

16

21

39

98

61

34

95

73

82

24

5

7

14

19

66

26

66

61

43

60

55

58

56

37

49

5

31

12

91

89

11

27

99

63

90

91

39

42

11

74

79

2

39

50

94

18

76

19

62

75

18

38

46

2

54

6

59

2

21

65

22

76

3

10

90

97

83

72

4

6

5

62

95

20

71

93

42

62

62

87

42

98

85



Note that print appears to output an extra newline. This is because each line of text input from the file retains its newline character.

In cases where you might want to read a specified number of lines from a file (say, the first line only), you can use the file method readline(). The readline() method consumes a line of input and returns this string, including the newline. If readline() encounters the end of the file, it returns the empty string. The next code segment uses the while True loop to input all of the lines of text with readline:

In [None]:
f = open("myfile.txt", 'r')
while True:
    line = f.readline()
    if line == "":
        break
    print(line)

First line.

Second line.



###Reading Numbers from a File

All of the file input operations return data to the program as strings. If these strings represent other types of data, such as integers or floating-point numbers, the programmer must convert them to the appropriate types before manipulating them further. In
Python, the string representations of integers and floating-point numbers can be converted to the numbers themselves by using the functions int() and float(), respectively.

When reading data from a file, another important consideration is the format of the data items in the file. Earlier, we showed an example code segment that output integers separated by newlines to a text file. During input, these data can be read with a simple for loop. This loop accesses a line of text on each pass. To convert this line to the integer contained in it, the programmer runs the string method strip() to remove the newline and then runs the int() function to obtain the integer value. The following code segment illustrates this technique. It opens the file of random
integers written earlier, reads them, and prints their sum.

In [None]:
f = open("integers.txt", 'r')
sum = 0
for line in f:
    line = line.strip()
    number = int(line)
    sum += number
print("The sum is", sum)

The sum is 4890


Obtaining numbers from a text file in which they are separated by spaces is a bit trickier. One method proceeds by reading lines in a for loop, as before. But each line now can contain several integers separated by spaces. You can use the string method split() to obtain a list of the strings representing these integers,
and then process each string in this list with another for loop.
The next code segment modifies the previous one to handle integers separated by spaces and/or newlines.

25 52 36 

12 47 99 56

23

56

78 98

In [None]:
f = open("integers.txt", 'r')
sum = 0
for line in f:
    wordlist = line.split()
    for word in wordlist:
        number = int(word)
        sum += number
print("The sum is", sum)

The sum is 4890


Note that the line does not have to be stripped of the newline, because split() takes care of that automatically.

| METHOD | WHAT IT DOES |
| --- | --- |
| open(pathname, mode) | Opens a file at the given pathname and returns a file object. The mode can be 'r', 'w', 'rw', or 'a'. The last two values, 'rw' and 'a', mean read\/write and append,respectively. |
| f.close() | Closes an output file. Not needed for input files. |
| f.write(aString) | Outputs aString to a file. |
| f.read() | Inputs the contents of a file and returns them as a single string. Returns '' if the end of file is reached. |
| f.readline() | Inputs a line of text and returns it as a string,including the newline. Returns '' if the end of file is reached. |

##Accessing and Manipulating Files and Directories on Disk

When designing Python programs that interact with files, it’s a good idea to include error recovery. For example, before attempting to open a file for input, the programmer should check to see if a file with the given pathname exists on the disk. The tables below explain some file system functions, including a function (os.path.exists()) that supports this checking. They also list some functions that allow your programs to navigate to a given directory in the file system, as well as perform some disk housekeeping. The functions listed in Tables below are self-explanatory, and you are encouraged to experiment. For example, the following code segment will print all of the names of files in the current working directory:


In [None]:
import os

currentDirectoryPath = os.getcwd()
print(currentDirectoryPath)
listOfFileNames = os.listdir(currentDirectoryPath)
for name in listOfFileNames:
    print(name)

if not os.path.exists("testFolder"):
    os.mkdir("testFolder")

listOfFileNames = os.listdir(currentDirectoryPath)
for name in listOfFileNames:
    print(name)



/content/drive/My Drive
LTE-4G-Suh-Tang
Test
Guang
SPDY-ISU
Learning Recommender System
IoT
NSF-IUSE-2016
LRS – EISL – NSF 2016
IT377 for TAB Review
Textbook-Writing
Study Notes
Colab Notebooks
data
fastAI
ML4IoT
ML
ISU
Unsorted
IWS-Xylem-11082019.pdf
IWS-Xylem-11082019.pptx
MLwBlockchain
IT170
LTE-4G-Suh-Tang
Test
Guang
SPDY-ISU
Learning Recommender System
IoT
NSF-IUSE-2016
LRS – EISL – NSF 2016
IT377 for TAB Review
Textbook-Writing
Study Notes
Colab Notebooks
data
fastAI
ML4IoT
ML
ISU
Unsorted
IWS-Xylem-11082019.pdf
IWS-Xylem-11082019.pptx
MLwBlockchain
IT170
testFolder


We can remove the previously created testing folder now.

In [None]:
if os.path.exists("testFolder"):
    os.rmdir("testFolder")

| os MODULE FUNCTION | WHAT IT DOES |
| --- | --- |
| chdir(path) | Changes the current working directory to path. |
| getcwd() | Returns the path of the current working directory. |
| listdir(path) | Returns a list of the names in directory named path. |
| mkdir(path) | Creates a new directory named path and places it in the current working directory. |
| remove(path) | Removes the file named path from the current working directory. |
| rename(old, new) | Renames the file or directory named old to new. |
| rmdir(path) | Removes the directory named path from the current working directory. |

| os.path MODULE FUNCTION | WHAT IT DOES |
| --- | --- |
| exists(path) | Returns True if path exists and False otherwise. |
| isdir(path) | Returns True if path names a directory and False otherwise. |
| isfile(path) | Returns True if path names a file and False otherwise. |
| getsize(path) | Returns the size of the object names by path in bytes. |

##Mount the Google Drive to Google Colab

* Step 1:  go to your Google Colab then type the code below:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
os.chdir("/content/drive/My Drive/")
currentDirectoryPath = os.getcwd()
print(currentDirectoryPath)


/content/drive/My Drive


You can try to create a folder called "IT170test", and go to Google drive to verify the creation of the folder.

In [None]:
if not os.path.exists("IT170test"):
    os.mkdir("IT170test")

Then, you can remove it, and go to Google drive to verify the removal of the folder.

In [None]:
if os.path.exists("IT170test"):
    os.rmdir("IT170test")

##Data Serilization 

For various applications, we may use different data structures (e.g., dictionaries, DataFrames) to store information. In some cases, we might want to save them to a file, so you can send them to other applications or use them later on. In Python, we can easily implement such a functionality using a module called **pickle**. Technically, everything (including all data structures) in Pythin is object. The process of moving a Python object (e.g., sending to network or saving to files) is calle object serialization.

###What is pickle?
Pickle is a Python module used for serializing and de-serializing Python object structures, also called marshalling or flattening. Serialization refers to the process of converting an object in memory to a byte stream that can be stored on disk or sent over a network. Later on, this character stream can then be retrieved and de-serialized back to a Python object. 

Don't confuse yourself between pickling and compression. The former is the conversion of an object from one representation (data in Random Access Memory (RAM)) to another (text on disk), while the latter is the process of encoding data with fewer bits, in order to save disk space.

###What Can You Do With pickle?
Pickling is useful for applications where you need keep data or information persistency. For example, your program's state data can be saved to disk, so you can continue working on it later on. It can also be used to send data over a network using some transmission protocol such as Transmission Control Protocol (TCP), or to store Python objects in a database. 

Pickle is very useful for when you're working with machine learning algorithms, where you want to save them to be able to make new predictions at a later time, without having to rewrite everything or train the model all over again.

Serilized objects via pickle cannot be guaranteed readable using a different programming language. The same holds for different versions of Python itself. Unpickling a file that was pickled in a different version of Python may not always work properly. You should also try not to unpickle data from an untrusted source. Malicious code inside the file might be executed upon unpickling.

##Storing data with pickle

###What can be pickled?

We can pickle objects with the following common data types:
* Booleans
* Integers
* Floats
* Strings
* Tuples
* Lists
* Dictionaries (that ontain picklable objects)

All the above can be pickled, but you can also do the same for classes and functions, for example, if they are defined at the top level of a module. Not everything can be easily pickled. For example, inner classes, lambda functions. 

###Pickle vs JSON

JSON stands for JavaScript Object Notation. It's a lightweight format for data-interchange, that is easily readable by humans. Although it was derived from JavaScript, JSON is standardized and language-independent. This is a much desirable advantage over pickle. It's also more secure and much faster than pickle.

However, if you only need to use Python, then the pickle module is still a good choice for its ease of use and ability to reconstruct complete Python objects.

An alternative is cPickle. It is nearly identical to pickle, but written in C, which makes it up to 1000 times faster. For small files, however, you won't notice the difference in speed. Both produce the same data streams, which means that Pickle and cPickle can use the same files.

Pickling/unpickling files

In the following example, we will be pickling a simple student record dictionary. We will save it to a file and then load again. 

To open the file for writing, simply use the open() function. The first argument should be the name of your file. The second argument is 'wb'. The **w** means that you'll be writing to the file, and **b** refers to binary mode. This means that the data will be written in the form of byte objects. If you forget the b, a TypeError: must be str, not bytes will be returned. Please note, a simple character in a string may need be represented using one to four bytes (Python 3 uses unicode encoding). Similarly, if skipping **b** from rb when opening the file for reading, you will see error message because some bytes do not belong to the encoding used in Python and thus can not be decoded properly. 



In [None]:
import pickle

stuRecord_dict = {'Alice': 90, 'Bob' : 88, 'Cindy':98}

print('The original data is:')
print(stuRecord_dict)

filename = 'stuRecord'
outputFile = open(filename,'wb')

pickle.dump(stuRecord_dict, outputFile)
outputFile.close()

infile = open(filename,'rb')
new_dict = pickle.load(infile)
infile.close()

print('The unpickled data is:')
print(new_dict)

The original data is:
{'Alice': 90, 'Bob': 88, 'Cindy': 98}
The unpickled data is:
{'Alice': 90, 'Bob': 88, 'Cindy': 98}


##Shelve Module

The pickle module is for serializing multiple objects as a single bytestream in a file.

Shelve is a python module, built on top of pickle and implements a serialization dictionary where objects are pickled, but associated with a key (some string). You can load your shelved data file and access your pickled objects via keys. This could be more convenient when you need to serialize multiple objects. The shelve module can be used as a simple persistent storage option for Python objects when a relational database is overkill. The shelf is accessed by keys, just as with a dictionary.

###Using shelve

In the following example, we use shelve.open() to open a  file that takes dictionary style (key:value) input data, where each key is chosen by a user to refer to a pickled object. Yes, shelve uses pickle to serilize objects. In this example, we create four objects (three lists and one dictionary) called stu_list, ex1_list, ex2_list, and stuRecord_dict. Then, we store the four objects into the shelf file (called shelfFile), each with different key, namely, stu, ex1, ex2, stuRec.  Later on, we can retrieve specific object with the corresponding key. 



In [None]:
import shelve

shelfFile = shelve.open('mydata')

print(type(shelfFile))

stu_list = ['Alice', 'Bob', 'Cindy']
ex1_list = [80, 70, 90]
ex2_list = [90, 80, 100]
stuRecord_dict = {'Alice': 90, 'Bob' : 88, 'Cindy':98} 

shelfFile['stu'] = stu_list
shelfFile['ex1'] = ex1_list
shelfFile['ex2'] = ex2_list
shelfFile['stuRec'] = stuRecord_dict

shelfFile.close()

shelfFile = shelve.open('mydata')

print(shelfFile['stu'])
print(shelfFile['ex1'])
print(shelfFile['ex2'])
print(shelfFile['stuRec'])

shelfFile.close()



<class 'shelve.DbfilenameShelf'>
['Alice', 'Bob', 'Cindy']
[80, 70, 90]
[90, 80, 100]
{'Alice': 90, 'Bob': 88, 'Cindy': 98}


In [None]:
s1 = 'Tom'
s2 = '10.1.1.1'
':'.join([s1, s2])

'Tom:10.1.1.1'

In [None]:
'#'.join(['cats', 'rats', 'bats'])

'cats#rats#bats'