# Think Python

## Chapter 14 - Files

### 14.1 Persistence

*HTML of this chapter in "Think Python 2e" can be found [here](http://greenteapress.com/thinkpython2/html/thinkpython2015.html "Chapter 14").*

### 14.2 Reading and writing




In [1]:
import os

path = "C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14"
os.chdir(path)

In [2]:
# writing a file
fout = open('output.txt', 'w')

In [3]:
# writing to the file
line1 = "This here's the wattle,\n"
fout.write(line1)

24

In [4]:
# adding new data
line2 = "the emblem of our land.\n"
fout.write(line2)

24

In [5]:
fout.close()

### 14.3 Format operator



In [6]:
camels = 42
'%d' % camels

'42'

### 14.4 Filenames and paths

In [7]:
# author's code

import os

def walk(dirname):
    for name in os.listdir(dirname):
        path = os.path.join(dirname, name)
        
        if os.path.isfile(path):
            print(path)
        else:
            walk(path)

In [8]:
mpath = "C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython"
walk(mpath)

C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\alice.txt
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\alice_new.txt
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\austen.txt
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\beowulf.txt
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\canterbury.txt
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\Chapter12\pigwar.txt
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\Chapter12\Think Python - Chapter 12 for GH.ipynb
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\Chapter12\Think Python - Chapter 12.ipynb
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\Chapter13\Think Python - Chapter 13.ipynb
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\Chapter14\anagrams.txt
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\Chapter14\anagrams_dict.db.bak
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\Chapter14\anagrams_dict.db.dat
C:\Users\mjcor\Desktop\ProgrammingStuff\ThinkPython\Chapter14\anagrams_dict

### 14.5 Catching exceptions

*__No notes.__*

### 14.6 Databases



In [9]:
import dbm
db = dbm.open('captions', 'c')

In [10]:
# creating new item

db['cleese.png'] = 'Photo of John Cleese'

In [11]:
db['cleese.png']

b'Photo of John Cleese'

In [12]:
db['cleese.png'] = 'Photo of John Cleese doing a silly walk.'

In [13]:
db['cleese.png']

b'Photo of John Cleese doing a silly walk.'

In [14]:
for key in db:
    print(key, db[key])

b'cleese.png' b'Photo of John Cleese doing a silly walk.'


In [15]:
db.close()

### 14.7 Pickling

*`pickle.dumps` takes an object as a parameter and returns a string representation (dumps is short for “dump string”):*

In [16]:
import pickle
t = [1, 2, 3]
pickle.dumps(t)

b'\x80\x03]q\x00(K\x01K\x02K\x03e.'

*`pickle.loads` (“load string”) reconstitutes the object:*

In [17]:
s = pickle.dumps(t)
t2 = pickle.loads(s)
t2

[1, 2, 3]

### 14.8 Pipes

*You can run shell commands by using a __pipe object__:*

In [18]:
# commands are different on windows

import subprocess

cmd = 'dir'
fp = os.popen(cmd)
res = fp.read()

*__Closing the pipe:__*

In [19]:
stat = fp.close()
print(stat)

None


*__Using pipe to run `md5sum` from Python and getting the result:__*

In [20]:
filename = 'book.tex'
cmd = 'md5sum ' + filename
fp = os.popen(cmd)
res = fp.read()
stat = fp.close()
print(res)





### 14.12  Exercises

#### Exercise 1  

*Write a function called `sed` that takes as arguments a pattern string, a replacement string, and two filenames; it should read the first file and write the contents into the second file (creating it if necessary). If the pattern string appears anywhere in the file, it should be replaced with the replacement string.*

*If an error occurs while opening, reading, writing or closing files, your program should catch the exception, print an error message, and exit.*

In [21]:
import string

def sed(filename1, filename2, pattern, replacement):
    
    try:
        fin = open(filename1, 'r')
        fout = open(filename2, 'w')

        for line in fin:
            fout.write(line.replace(pattern, replacement))

        fout.close()
        
    except:
        print("That didn't go as planned...")

*__Text of `'test1.txt'`:__*

> *Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sed libero enim sed faucibus turpis in eu. Viverra ipsum nunc aliquet bibendum enim facilisis gravida neque convallis. Lectus sit amet est placerat in egestas erat imperdiet sed. Varius morbi enim nunc faucibus a pellentesque sit. Imperdiet sed euismod nisi porta. Phasellus vestibulum lorem sed risus ultricies tristique nulla aliquet enim. At risus viverra adipiscing at in tellus integer feugiat. Pellentesque dignissim enim sit amet. Fringilla phasellus faucibus scelerisque eleifend donec pretium vulputate. Consectetur adipiscing elit duis tristique sollicitudin nibh sit. Tincidunt lobortis feugiat vivamus at augue eget. Cursus vitae congue mauris rhoncus aenean vel.*

In [22]:
sed('test1.txt', 'test2.txt', 'et', 'zzzzz')

*__Text of `'test1.txt'` after running `sed`:__*

>*Lorem ipsum dolor sit amzzzzz, consectzzzzzur adipiscing elit, sed do eiusmod tempor incididunt ut labore zzzzz dolore magna aliqua. Sed libero enim sed faucibus turpis in eu. Viverra ipsum nunc aliquzzzzz bibendum enim facilisis gravida neque convallis. Lectus sit amzzzzz est placerat in egestas erat imperdizzzzz sed. Varius morbi enim nunc faucibus a pellentesque sit. Imperdizzzzz sed euismod nisi porta. Phasellus vestibulum lorem sed risus ultricies tristique nulla aliquzzzzz enim. At risus viverra adipiscing at in tellus integer feugiat. Pellentesque dignissim enim sit amzzzzz. Fringilla phasellus faucibus scelerisque eleifend donec przzzzzium vulputate. Consectzzzzzur adipiscing elit duis tristique sollicitudin nibh sit. Tincidunt lobortis feugiat vivamus at augue egzzzzz. Cursus vitae congue mauris rhoncus aenean vel.*

In [23]:
# 'test3.txt' does not exist

sed('test3.txt', 'test2.txt', 'et', 'rocking')

That didn't go as planned...


#### Exercise 2  

*Write a module that imports `anagram_sets` and provides two new functions: `store_anagrams` should store the anagram dictionary in a “shelf”; `read_anagrams` should look up a word and return a list of its anagrams.* 

*__I'm using jupyter notebooks, so this will be a little different compared to someone running Python from the console:__*

In [24]:
# %load anagram_sets.py 
def sort_letters(string):
    """
    Returns the letters in string as a new string
    whose letters are in alphabetical order.
    """
        
    return ''.join(sorted(list(string.lower())))

def find_anagrams(text):
    """
    Takes a text and returns a list of tuples of anagrams.
    First item in the tuple is the number of anagrams formed.
    Second item is the letters used.
    Third item is the anagrams.
    """

    sorted_dict = {}

    for line in text:
        orig_word = line.strip()
        sorted_word = sort_letters(orig_word)
        sorted_dict.setdefault(sorted_word, []).append(orig_word)
            
    return sorted_dict
    

In [25]:
%run anagram_sets.py

In [26]:
path_to_words = "C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\words.txt"

*__`.write()` won't work with dictionaries, and `pickle` is very slow, meaning we should use `shelve`, which was only mentioned - and not covered - in "Think Python 2e":__*

In [27]:
import shelve

def store_anagrams(path, dict_name):
    """
    Creates a dictionary of anagrams and stores
    it.
    
    Arguments:
    path: Path where word list can be found.
    dict_name: Name to be given to the 
        resulting dictionary.
    """
    
    anagrams = find_anagrams(open(path))

    ad = shelve.open(dict_name)
    for k, v in anagrams.items():
        ad[k] = v
    ad.close()
    


In [28]:
# `path_to_words` should be the local path to `words.txt`

store_anagrams(path_to_words, 'anagram_dict.db')

In [29]:
def read_anagrams(word, d):
    """
    Retrieves the anagrams for a word from an
    existing anagrams dictionary.
    
    Arguments:
    word: The word to be queried.
    d: The anagrams dictionary.  Must be in the 
        working directory.
    """
    
    try:
        sorted_word = sort_letters(word)
        s = shelve.open(d)
        anagram = s[sorted_word]
        s.close()

        print(anagram)
    except:
        print("'{}' was not found in '{}'.".format(word, d))
    


In [30]:
read_anagrams('post', 'anagram_dict.db')

['opts', 'post', 'pots', 'spot', 'stop', 'tops']


In [31]:
read_anagrams('least', 'anagram_dict.db')

['least', 'setal', 'slate', 'stale', 'steal', 'stela', 'taels', 'tales', 'teals', 'tesla']


In [32]:
read_anagrams('ohisashiburi', 'anagram_dict.db')

'ohisashiburi' was not found in 'anagram_dict.db'.


#### Exercise 3  

*In a large collection of MP3 files, there may be more than one copy of the same song, stored in different directories or with different file names. The goal of this exercise is to search for duplicates.*

<ol>
    <li><i>Write a program that searches a directory and all of its subdirectories, recursively, and returns a list of complete paths for all files with a given suffix (like .mp3). Hint: os.path provides several useful functions for manipulating file and path names.</li></i>
    <li><i>To recognize duplicates, you can use md5sum to compute a “checksum” for each files. If two files have the same checksum, they probably have the same contents.</li></i>
    <li><i>To double-check, you can use the Unix command diff.</li></i>
</ol>

In [33]:
import hashlib

In [34]:
hashlib.md5("filename.exe".encode('utf-8')).hexdigest()

'2a53375ff139d9837e93a38a279d63e5'

In [54]:
path = "C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\folder1"
os.chdir(path)

In [55]:
# author's code

for name in os.listdir(path):
        newpath = os.path.join(path, name)
        
        if os.path.isfile(newpath):
            print(hashlib.md5(name.encode('utf-8')).hexdigest())


f1e7d47868050f37f328e856a2941eb1


In [56]:
path = "C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\folder2"
os.chdir(path)

In [57]:
for name in os.listdir(path):
        newpath = os.path.join(path, name)
        
        if os.path.isfile(newpath):
            print(hashlib.md5(name.encode('utf-8')).hexdigest())

f1e7d47868050f37f328e856a2941eb1


In [69]:
import hashlib

# d = {}

def walk_and_hash(dirname):
    for name in os.listdir(dirname):
        path = os.path.join(dirname, name)
        
        if os.path.isfile(path):
            hsh = hashlib.md5(name.encode('utf-8')).hexdigest()
            if hsh not in d:
                d[hsh] = [path]
            else:
                d[hsh] += [path]
        else:
            walk_and_hash(path)
    
    
            

def find_dup_files(dirname):
    d = {}
    walk_and_hash(dirname)
        
    for v in d.values():
        
        if len(v) > 1:
            print("{} are duplicates".format(v))


In [70]:
my_path = "C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme"

find_dup_files(my_path)

In [71]:
for v in d.values():
        if len(v) > 1:
            print("{} are duplicates".format(v))

['C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\folder1\\sumtest1a.txt', 'C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\folder1\\sumtest1a.txt', 'C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\folder1\\sumtest1a.txt'] are duplicates
['C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\folder2\\sumtest1.txt', 'C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\sumtest1.txt', 'C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\folder2\\sumtest1.txt', 'C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\sumtest1.txt', 'C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\folder2\\sumtest1.txt', 'C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\testme\\sumtest1.txt'] are duplicates
['C:\\Users\\mjcor\\Desktop\\ProgrammingStuff\\ThinkPython\\Chapter14\\tes