
### 我们将学到以下内容:

1. File I/O
3. Systems
4. Concurrency

## File I/O

When your program is running, all data generated are stored in RAM. RAM is fast but has two limitations:

1. Expensive (thus small in capacity)
2. Need constant power supply

Disk drivers are slower than RAM but is much cheaper and more importantly can retain data even when power is off. To keep our data persistant, we'll need to store it on disk drivers as *Files*.

The simplest kind of persistence is a plain old file, aka **flat file**. It's just *a sequence of bytes stored under a filename*. You may **read** a file into memory and **write** from memory to a file (on disk driver).

### 打开文件

Before reading or writing a file, you'll have to open it first.

```python
fileobj = open(filename, mode)
```

**mode** is a string indicating the file's type and what you want to do with it.

The first letter of mode indicates the operation:

- `r`: read
- `w`: write (If the file doesn't exist, create one. If the file exists, override it)
- `x`: write (Only if the file doesn't exist)
- `a`: append (write after the end) if the file exists

The second letter of mode indicates the file's type:

- `t (or nothing)`: plain text
- `b`: binary

### 写入文件

In [3]:
text = '''\
First line
Second line
Third line
End
'''

len(text)

38

In [4]:
fout = open('tmp1', 'wt')
ret = fout.write(text)
print('return value of write() =', ret)
fout.close()

return value of write() = 38


In [5]:
fout = open('tmp2', 'w')
print(text, file=fout)
fout.close()

> `write()` won't add any spaces or newlines, as `print()` does. However, `print()` also gives you the possibily to customize it's behavior.

In [6]:
print('a', 'b', 'c')
print('d', 'e', 'f')

a b c
d e f


In [7]:
print('a', 'b', 'c', sep='', end=' ')
print('d', 'e', 'f', sep=',')

abc d,e,f


In [8]:
import this

In [19]:
zen_of_py = '''\
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
'''

In [20]:
fout = open('zen_of_py', 'w')
size = len(zen_of_py)
offset = 0
chunk = 100

while offset < size:
    fout.write(zen_of_py[offset : offset+chunk])
    offset += chunk

fout.close()

In [14]:
fout = open('zen_of_py', 'x')

In [15]:
fout = open('tmp1', 'a')
fout.write('A new line after end\n')
fout.close()

### Read Text File

In [21]:
fin = open('tmp1', 'rt')
text = fin.read()
fin.close()
print(text)

First line
Second line
Third line
End
A new line after end



> Be carefull when calling `read()` with no arguments with large files. A gigabyte file will consume a gigabyte of memory.

In [22]:
zen_of_py = ''
fin = open('zen_of_py', 'r')
chunk = 100

while True:
    fragment = fin.read(chunk)
    if not fragment:
        break
    zen_of_py += fragment

fin.close()
print(zen_of_py)

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!



In [24]:
fin = open('tmp1', 'r')

while True:
    line = fin.readline()
    if not line:
        break
    print(line, end='')

fin.close()

First line
Second line
Third line
End
A new line after end


In [25]:
fin = open('tmp1', 'r')
for line in fin:
    print(line, end='')
fin.close()

First line
Second line
Third line
End
A new line after end


In [26]:
fin = open('tmp1', 'r')
lines = fin.readlines()
fin.close()

lines

['First line\n',
 'Second line\n',
 'Third line\n',
 'End\n',
 'A new line after end\n']

### Write Binary File

In [27]:
bdata = bytes(range(256))
len(bdata)

256

In [28]:
print(bdata[0])
print(bdata[255])

0
255


In [29]:
fout = open('bfile', 'wb')
ret = fout.write(bdata)
fout.close()
print('Write %d bytes' % ret)

Write 256 bytes


In [30]:
fout = open('bfile', 'wb')
size = len(bdata)
offset = 0
chunk = 100

while offset < size:
    ret = fout.write(bdata[offset : offset+chunk])
    print(ret)
    offset += chunk

fout.close()

100
100
56


### Read Binary File

In [31]:
fin = open('bfile', 'rb')
bdata = fin.read()
fin.close()
len(bdata)

256

### Close Files Automatically by Using `with`

If you opened an file and forget to close it, Python will close it for you when you jump out of the scope where the `open` is. However, a safer way would be using `with` keyword.

In [32]:
with open('tmp1', 'w') as fout:
    fout.write(text)

### Change Position

As you read and write, Python keeps track fo where you are in the file. The `tell()` returns your current offset from the beginning of the file, in bytes. The `seek()` let's you jump to another byte offset in the file.

In [33]:
fin = open('tmp1', 'r')
print('starting pos =', fin.tell())
line = fin.readline()
print('# char read =', len(line))
print('ending pos =', fin.tell())
fin.close()

starting pos = 0
# char read = 11
ending pos = 11


> This only works for text files in **ASCII** encoding where each character is store in a single byte. Encodings like **UTF-8** may use varying numbers of bytes per character.

In [34]:
fin = open('bfile', 'rb')
fin.seek(255)
bdata = fin.read()

print(len(bdata))
print(bdata[0])

1
255


You call call `seek()` with a second argument: `seek(offset, origin)`. 

- If `origin` is `0` (default), go `offset` bytes from the start.
- If `origin` is `1`, go `offset` from the current position.
- If `origin` is `2`, go `offset` bytes relative to the end (i.e. `offset` has to be negative).

> `origin` is not a keyword argument.

In [35]:
fin = open('bfile', 'rb')
fin.seek(-1, 2)
bdata = fin.read()

print(len(bdata))
print(bdata[0])

1
255


## Structured Text Files

With plain text files, the only level of organization is the *line* (separated by newline character). Somtimes you may want a richer structure. One way to do it is by introducing extra separators.

- Comma-Separated Values (CSV): `\t`, `,`, `|`
- HTML & XML: `<`, `>`
- JSON: `{`, `}`, `[`, `]`


### CSV

In [36]:
import csv

dc_heros = [
    ['Flash', 'Barry Allen'],
    ['Green Arrow', 'Oliver Queen'],
    ['Atom', 'Ray Palmer'],
    ['Bat Man', 'Bruce Wayne']
]

In [37]:
with open('dc_heros.csv', 'w') as fout:
    csvout = csv.writer(fout)
    csvout.writerows(dc_heros)

In [38]:
with open('dc_heros.csv', 'r') as fin:
    csvin = csv.reader(fin)
    dc_heros = [row for row in csvin]

print(dc_heros)

[['Flash', 'Barry Allen'], ['Green Arrow', 'Oliver Queen'], ['Atom', 'Ray Palmer'], ['Bat Man', 'Bruce Wayne']]


In [42]:
with open('dc_heros.csv', 'r') as fin:
    csvin = csv.DictReader(fin, fieldnames=['Name', 'Real Name'])
    dc_heros = [row for row in csvin]

dc_heros

[OrderedDict([('Name', 'Name'), ('Real Name', 'Real Name')]),
 OrderedDict([('Name', 'Flash'), ('Real Name', 'Barry Allen')]),
 OrderedDict([('Name', 'Green Arrow'), ('Real Name', 'Oliver Queen')]),
 OrderedDict([('Name', 'Atom'), ('Real Name', 'Ray Palmer')]),
 OrderedDict([('Name', 'Bat Man'), ('Real Name', 'Bruce Wayne')])]

In [43]:
dc_heros = [
    {'Name': 'Flash',       'Real Name': 'Barry Allen'},
    {'Name': 'Green Arrow', 'Real Name': 'Oliver Queen'},
    {'Name': 'Atom',        'Real Name': 'Ray Palmer'},
    {'Name': 'Bat Man',     'Real Name': 'Bruce Wayne'},
]

with open('dc_heros.csv', 'w') as fout:
    cout = csv.DictWriter(fout, dc_heros[0].keys())
    cout.writeheader()
    cout.writerows(dc_heros)

In [44]:
with open('dc_heros.csv', 'r') as fin:
    csvin = csv.DictReader(fin)
    dc_heros = [row for row in csvin]

dc_heros

[OrderedDict([('Name', 'Flash'), ('Real Name', 'Barry Allen')]),
 OrderedDict([('Name', 'Green Arrow'), ('Real Name', 'Oliver Queen')]),
 OrderedDict([('Name', 'Atom'), ('Real Name', 'Ray Palmer')]),
 OrderedDict([('Name', 'Bat Man'), ('Real Name', 'Bruce Wayne')])]

### JSON

In [45]:
temperature = {
    'avg': 20.0,
    'daily': [
        {'highest': 21.0, 'lowest': 19.0},
        {'highest': 22.0, 'lowest': 18.0},
        {'highest': 23.0, 'lowest': 17.0}
    ]
}

In [46]:
import json

temperature_json = json.dumps(temperature)
type(temperature_json)

str

In [47]:
print(temperature_json)

{"avg": 20.0, "daily": [{"highest": 21.0, "lowest": 19.0}, {"highest": 22.0, "lowest": 18.0}, {"highest": 23.0, "lowest": 17.0}]}


In [48]:
temperature2 = json.loads(temperature_json)

from pprint import pprint
pprint(temperature2)

{'avg': 20.0,
 'daily': [{'highest': 21.0, 'lowest': 19.0},
           {'highest': 22.0, 'lowest': 18.0},
           {'highest': 23.0, 'lowest': 17.0}]}


You may get an exception when you try to encode or decode some custom objects (or even some built-in objects like `datetime`)

In [49]:
class Person():
    def __init__(self, name, gender):
        self.name = name
        self.gender = gender
    
    def say(self):
        print("Hello I'm %s, nice to meet you." % self.name)

In [50]:
p = Person('Edward', 'Male')
p_json = json.dumps(p)

TypeError: Object of type 'Person' is not JSON serializable

In [51]:
# If your class is simple enough
p_json = json.dumps(p.__dict__)
print(p_json)

{"name": "Edward", "gender": "Male"}


In [52]:
from json import JSONEncoder

class PersonEncoder(JSONEncoder):
    def default(self, obj):
        return {'Name': obj.name, 'Gender': obj.gender}

In [53]:
p_json = json.dumps(p, cls=PersonEncoder)
print(p_json)

{"Name": "Edward", "Gender": "Male"}


### Serialize by Using pickle

Saving data structures to a file is called **serializing**. Formats such as JSON might require some custom converters to serialize all the data types from a Python program. Python provides the pickle module to save and restore any object in a **special binary format**.

In [54]:
import pickle
p_pickle = pickle.dumps(p)
p_pickle

b'\x80\x03c__main__\nPerson\nq\x00)\x81q\x01}q\x02(X\x04\x00\x00\x00nameq\x03X\x06\x00\x00\x00Edwardq\x04X\x06\x00\x00\x00genderq\x05X\x04\x00\x00\x00Maleq\x06ub.'

In [55]:
p2 = pickle.loads(p_pickle)
p2.say()

Hello I'm Edward, nice to meet you.


In [56]:
with open('ed.pkl', 'wb') as fout:
    pickle.dump(p, fout)

> <span style="color:red">**Warning**</span>: The `pickle` module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

Pickle provides 5 protocols to savae data, the higher the protocol is, the more recent Python version is needed:

#### Pickle Protocols

- 0: Original “human-readable” protocol and is backwards compatible with earlier versions of Python.
- 1: Old binary format which is also compatible with earlier versions of Python.
- 2: Provides much more efficient pickling of new-style classes. (>= Python 2.3)

Python 3.x only

- 3: Default. Explicit support for bytes objects and cannot be unpickled by Python 2.x.
- 4: Adds support for very large objects, pickling more kinds of objects, and some data format optimizations.(>= Python 3.4)

In [57]:
with open('ed_readable.pkl', 'wb') as fout:
    pickle.dump(p, fout, protocol=0)

## Systems

Python provides many system functions through a module names `os`.

### Files

In [None]:
import os

os.path.exists('tmp1')

In [None]:
os.path.exists('../T8')

In [None]:
os.path.isfile('tmp1')

In [None]:
os.path.isfile('../T8')

In [None]:
os.path.isdir('../T8')

In [None]:
os.path.isabs('tmp1')

In [None]:
os.path.isabs('/tmp1')

> `isabs()` won't check whether the directory really exists or not.

In [None]:
os.listdir('.')

In [None]:
os.rename('tmp2', 'tmp')
os.listdir('.')

In [None]:
os.rename('tmp1', 'tmp')
os.listdir('.')

> `reanme()` to an exsisting file will cause overrding.

In [None]:
os.remove('bfile')
os.listdir('.')

In [None]:
import shutil

shutil.copy('tmp', 'tmp_copy')
os.listdir('.')

In [None]:
os.link('tmp', 'tmp_hard_link')
os.listdir('.')

In [None]:
print(os.path.isfile('tmp_hard_link'))
print(os.path.islink('tmp_hard_link'))

In [None]:
os.symlink('tmp', 'tmp_soft_link')
os.listdir('.')

In [None]:
print(os.path.isfile('tmp_soft_link'))
print(os.path.islink('tmp_soft_link'))

In [None]:
os.path.abspath('tmp_soft_link')

In [None]:
os.path.realpath('tmp_soft_link')

In [None]:
with open('tmp_soft_link', 'a') as f:
    f.write('append a new line')

In [None]:
with open('tmp', 'r') as f:
    print(f.read())

In [None]:
os.rename('tmp', 'tmp_alias')

In [None]:
os.path.realpath('tmp_soft_link')

In [None]:
with open('tmp_soft_link', 'r') as f:
    print(f.read())

In [None]:
with open('tmp_hard_link', 'r') as f:
    print(f.read())

In [None]:
os.listdir('.')

In [None]:
os.remove('tmp_alias')

In [None]:
with open('tmp_hard_link', 'r') as f:
    print(f.read())

**Copy**:

- You have two different versions of the file.
- If you edit one, the other one stays the same.
- If you delete one, the other one stays there, but it may not be identical if it was edited.
- Twice as much disk space used (two different files).

**Hard Link**:

- You have one file with two different filenames.
- If you edit one, it gets edited in all filename locations.
- If you delete one, it still exists in other places.
- Only one file on disk.

> You may think a filename as a hard link

**Soft Link**:

- You have one file with one filename and a pointer to that file with the other filename.
- If you edit the link, its really editing the original file.
- If you delete the file, the link is broken.
- If you remove the link, the file stays in place.
- Only one file on disk.

<img src='link.png' align='center' width='600px'>

### Directories

In [None]:
os.mkdir('untitled')
os.listdir('.')

In [None]:
os.rmdir('untitled')
os.path.exists('untitled')

In [None]:
os.chdir('..')
print(os.listdir('.'))

### List Matching with `glob()`

The `glob()` function mathces the file or directory names by using Unix shell rules rather than the more complete regular expression.

- `*` mathces everything
- `?` matches a single character
- `[abc]` matches a character `a`, `b` or `c`
- `[!abc]` matches any character except `a`, `b`, or `c`

In [None]:
import glob

glob.glob('T*')

In [None]:
glob.glob('???')

In [None]:
glob.glob('??3')

In [None]:
glob.glob('*[1, 2]')

### Programs and Processes

When you run an individual program, your operating system creates a single **process**. It uses system resources (CPU, memory, disk space) and data structures in the operating system's kernel.

A process is isolated from from other processes.

In [58]:
import os

print(os.getpid())
print(os.getcwd())

50396
/Users/joshuaz/Documents/Data Application Lab/教学/Beijing/万门/教学方案/计算机语言python/1218 进阶python/code


In [65]:
import subprocess

ret = subprocess.getoutput('date')
ret

'Mon Dec 18 11:47:37 CST 2017'

In [64]:
ret = subprocess.call('date')
ret

0

In [None]:
subprocess.call('say "When you run an individual program, your operating system creates a single process."', shell=True)

In [71]:
text = 'When you run an individual program, your operating system creates a single process.'
subprocess.call(['say', '-r', '300', text])

0

In [None]:
# https://sourceforge.net/projects/pywin32/
import win32com.client
spk = win32com.client.Dispatch("SAPI.SpVoice")
spk.Speak(u"这里加上要说的话")

In [68]:
ret = subprocess.getoutput('dir')
ret

'/bin/sh: dir: command not found'

In [69]:
# windows
ret = subprocess.call(['dir'],shell=True)
ret
#subprocess.call(['abc'],shell=True)

127

## Concurrency

So far, most of the programs what we've written run in one place (one machine) and one line at a time (sequential). But we can do more than one thin at a time (concurrency) and in more than one place (distributed computing or networking).

In [None]:
import threading

def do_this(what):
    whoami(what)

def whoami(what):
    print('Thread %s says" %s' % (threading.current_thread(), what))

if __name__ == '__main__':
    whoami("I'm the main program")
    for n in range(4):
        p = threading.Thread(target=do_this, args=("I'm function %s" % n,))
        p.start()