# Think Python 

## Chapter 14 - Files

*HTML of this chapter in "Think Python 2e" can be found [here](http://greenteapress.com/thinkpython2/html/thinkpython2015.html "Chapter 14").*



### 14.12  Exercises

#### Exercise 1  

*Write a function called `sed` that takes as arguments a pattern string, a replacement string, and two filenames; it should read the first file and write the contents into the second file (creating it if necessary). If the pattern string appears anywhere in the file, it should be replaced with the replacement string.*

*If an error occurs while opening, reading, writing or closing files, your program should catch the exception, print an error message, and exit.*

In [2]:
import string

def sed(filename1, filename2, pattern, replacement):
    
    try:
        fin = open(filename1, 'r')
        fout = open(filename2, 'w')

        for line in fin:
            fout.write(line.replace(pattern, replacement))

        fout.close()
        
    except:
        print("That didn't go as planned...")

*__Text of `'test1.txt'`:__*

> *Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Sed libero enim sed faucibus turpis in eu. Viverra ipsum nunc aliquet bibendum enim facilisis gravida neque convallis. Lectus sit amet est placerat in egestas erat imperdiet sed. Varius morbi enim nunc faucibus a pellentesque sit. Imperdiet sed euismod nisi porta. Phasellus vestibulum lorem sed risus ultricies tristique nulla aliquet enim. At risus viverra adipiscing at in tellus integer feugiat. Pellentesque dignissim enim sit amet. Fringilla phasellus faucibus scelerisque eleifend donec pretium vulputate. Consectetur adipiscing elit duis tristique sollicitudin nibh sit. Tincidunt lobortis feugiat vivamus at augue eget. Cursus vitae congue mauris rhoncus aenean vel.*

In [3]:
sed('test1.txt', 'test2.txt', 'et', 'zzzzz')

*__Text of `'test1.txt'` after running `sed`:__*

>*Lorem ipsum dolor sit amzzzzz, consectzzzzzur adipiscing elit, sed do eiusmod tempor incididunt ut labore zzzzz dolore magna aliqua. Sed libero enim sed faucibus turpis in eu. Viverra ipsum nunc aliquzzzzz bibendum enim facilisis gravida neque convallis. Lectus sit amzzzzz est placerat in egestas erat imperdizzzzz sed. Varius morbi enim nunc faucibus a pellentesque sit. Imperdizzzzz sed euismod nisi porta. Phasellus vestibulum lorem sed risus ultricies tristique nulla aliquzzzzz enim. At risus viverra adipiscing at in tellus integer feugiat. Pellentesque dignissim enim sit amzzzzz. Fringilla phasellus faucibus scelerisque eleifend donec przzzzzium vulputate. Consectzzzzzur adipiscing elit duis tristique sollicitudin nibh sit. Tincidunt lobortis feugiat vivamus at augue egzzzzz. Cursus vitae congue mauris rhoncus aenean vel.*

In [4]:
# 'test3.txt' does not exist

sed('test3.txt', 'test2.txt', 'et', 'rocking')

That didn't go as planned...


#### Exercise 2  

*Write a module that imports `anagram_sets` and provides two new functions: `store_anagrams` should store the anagram dictionary in a “shelf”; `read_anagrams` should look up a word and return a list of its anagrams.* 

*__I'm using jupyter notebooks, so this will be a little different compared to someone running Python from the console:__*

In [5]:
# %load anagram_sets.py 
def sort_letters(string):
    """
    Returns the letters in string as a new string
    whose letters are in alphabetical order.
    """
        
    return ''.join(sorted(list(string.lower())))

def find_anagrams(text):
    """
    Takes a text and returns a list of tuples of anagrams.
    First item in the tuple is the number of anagrams formed.
    Second item is the letters used.
    Third item is the anagrams.
    """

    sorted_dict = {}

    for line in text:
        orig_word = line.strip()
        sorted_word = sort_letters(orig_word)
        sorted_dict.setdefault(sorted_word, []).append(orig_word)
            
    return sorted_dict
    

In [6]:
%run anagram_sets.py

In [7]:
path_to_words = "E:\\words.txt"

*__`.write()` won't work with dictionaries, and `pickle` is very slow, meaning we should use `shelve`, which was only mentioned - and not covered - in "Think Python 2e":__*

In [8]:
import shelve

def store_anagrams(path, dict_name):
    """
    Creates a dictionary of anagrams and stores
    it.
    
    Arguments:
    path: Path where word list can be found.
    dict_name: Name to be given to the 
        resulting dictionary.
    """
    
    anagrams = find_anagrams(open(path))

    ad = shelve.open(dict_name)
    for k, v in anagrams.items():
        ad[k] = v
    ad.close()
    


In [9]:
# `path_to_words` should be the local path to `words.txt`

store_anagrams(path_to_words, 'anagram_dict.db')

In [10]:
def read_anagrams(word, d):
    """
    Retrieves the anagrams for a word from an
    existing anagrams dictionary.
    
    Arguments:
    word: The word to be queried.
    d: The anagrams dictionary.  Must be in the 
        working directory.
    """
    
    try:
        sorted_word = sort_letters(word)
        s = shelve.open(d)
        anagram = s[sorted_word]
        s.close()

        print(anagram)
    except:
        print("'{}' was not found in '{}'.".format(word, d))
    


In [11]:
read_anagrams('post', 'anagram_dict.db')

['opts', 'post', 'pots', 'spot', 'stop', 'tops']


In [12]:
read_anagrams('least', 'anagram_dict.db')

['least', 'setal', 'slate', 'stale', 'steal', 'stela', 'taels', 'tales', 'teals', 'tesla']


In [13]:
read_anagrams('ohisashiburi', 'anagram_dict.db')

'ohisashiburi' was not found in 'anagram_dict.db'.


#### Exercise 3  

*In a large collection of MP3 files, there may be more than one copy of the same song, stored in different directories or with different file names. The goal of this exercise is to search for duplicates.*

<ol>
    <li><i>Write a program that searches a directory and all of its subdirectories, recursively, and returns a list of complete paths for all files with a given suffix (like .mp3). Hint: os.path provides several useful functions for manipulating file and path names.</li></i>
    <li><i>To recognize duplicates, you can use md5sum to compute a “checksum” for each files. If two files have the same checksum, they probably have the same contents.</li></i>
    <li><i>To double-check, you can use the Unix command diff.</li></i>
</ol>

*__I had a number of difficulties with this problem, and for the wrong reasons:  The first problem I had was with `popen`, which wasn't working correctly on my system (perhaps because it was deprecrated).  Therefore, I wasn't able to use `md5sum` the way shown in the book. As a result, I ended up using the library `hashlib`, which wasn't so difficult to use.  Furthermore, since MD5 has vulnerabilities, I decided to use SHA256 instead, which worked just as well for my purposes.  Another difficulty I had was that I'm not using a UNIX system, so instead of using the UNIX command `diff`, I had to use `cmp` from the `filecmp` library, which - as far as I can tell - works just as well for a problem like this.__*

*__In the end, I ended up looking at the author's solution for assistance on this problem, which I generally have not done while working my way through "Think Python 2e". It's clear to see that my code is influenced on the author's code, but further to the differences noted above, I made one more key addition: the author's code used the UNIX command `diff` to compare pairs of files, but it's possible that there may be more than two identical files in a directory.  My function `find_duplicates` can check multiple files for identical contents.__*

*__One final note: this code is not particularly fast, and might lag with large folders.__*

In [14]:
import os, hashlib, filecmp

def make_sha256(path):
    """
    Creates a SHA-256 hash object for the file at
    designated path.
    """
    return hashlib.sha256(open(path,'rb').read()).hexdigest()

def walk(dirname):
    """
    Nearly identical to the code presented in 
    the solutions for 'Think Python 2e', chapter 14
    """

    names = []

    for name in os.listdir(dirname):
        path = os.path.join(dirname, name)
        
        if os.path.isfile(path):
            names.append(path)
        else:
            names.extend(walk(path))
            
    return names

def make_dict_for_suffix(dirname, suffix):
    """
    Returns a dictionary of checksums for all files 
    in dirname with designated suffix.
    """
    
    files = walk(dirname)
    
    d = {}
    for file in files:
        if file.endswith(suffix):
            sha256 = make_sha256(file)
            
            if sha256 in d:
                d[sha256].append(file)
            else:
                d[sha256] = [file]
                
    return d
                
def confirm_same(file1, file2):
    """
    Returns True if two files have same contents.
    """
    return filecmp.cmp(file1, file2)
    
def find_duplicates(d):
    """
    Prints file paths of files in map d that have
    same checksums.  Confirms if files have identical 
    contents. Can handle any number of identical files.
    """
    
    
    for k, v in d.items():
        if len(v) > 1:
            print('The following files have identical checksums:\n')
            for f in v:
                print('\t{}\n'.format(f))

            for i in range(len(v)):
                for j in range(i + 1, len(v)):
                    if confirm_same(d[k][i], d[k][j]):
                        print('\nThe following two files have identical contents: \n\n \t{} \n\n and \n\n \t{} \n'.format(d[k][i], d[k][j]))
                    else:
                        print('\nThe following two files do NOT have identical contents: \n\n \t{} \n\n and \n\n \t{} \n'.format(d[k][i], d[k][j]))
                        
                    j += 1
                i += 1
                

            


In [15]:
path = "E:\Chapter_14_Music"

d = make_dict_for_suffix(path, suffix = '.mp3')

In [16]:
find_duplicates(d)

The following files have identical checksums:

	E:\Chapter_14_Music\duplicate1\02 Janglin.mp3

	E:\Chapter_14_Music\duplicate2\02 Janglin.mp3

	E:\Chapter_14_Music\Edward Sharpe & The Magnetic Zeros\Up From Below\02 Janglin.mp3


The following two files have identical contents: 

 	E:\Chapter_14_Music\duplicate1\02 Janglin.mp3 

 and 

 	E:\Chapter_14_Music\duplicate2\02 Janglin.mp3 


The following two files have identical contents: 

 	E:\Chapter_14_Music\duplicate1\02 Janglin.mp3 

 and 

 	E:\Chapter_14_Music\Edward Sharpe & The Magnetic Zeros\Up From Below\02 Janglin.mp3 


The following two files have identical contents: 

 	E:\Chapter_14_Music\duplicate2\02 Janglin.mp3 

 and 

 	E:\Chapter_14_Music\Edward Sharpe & The Magnetic Zeros\Up From Below\02 Janglin.mp3 

