## More File Processing

So far we've learned looping, but what is it good for really?  Today we will see our first application of a for loop, that will make sense!  Today, we will be reading a file into memory, treating each of the lines of the file as strings and applying various processing techniques to them.  We'll first go over reading text into memory and then we'll go over some string processing techniques. 




In [2]:
with open("alice_in_wonderland.txt","r") as f:
    text = f.read()

The above code reads our text into memory for processing.  We are now ready to start processing and analyzing our piece of text!

There are lots of methods built into the python language for processing strings:

In [1]:
[elem for elem in dir(str()) if "__" not in elem]

['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

Today we will be going over:

* lower
* upper
* islower
* isnumeric
* isupper
* split
* splitlines
* translate
* strip
* lstrip
* rstrip
* isalnum
* isalpha
* isdecimal
* isspace
* isdigit
* find
* startswith
* endswith
* captialize

This may seem like a lot but don't worry!  It's really only a few categories of things :)

Let's get started  with `split` and then move onto the `startswith`, `endswith` methods, because those are pretty easy to work with.

In [3]:
with open("alice_in_wonderland.txt","r") as f:
    text = f.read()
lines = text.split("\n")
print(type(lines))
lines[0]

<class 'list'>


"Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll"

The `split` method allows us to specify a character to split the text on.  Every time the character is seen, it will be split at that character.  Here's a more worked example:

In [5]:
string = "For this string, we'll split, on, commas, okay?"
listing = string.split(",")
print(listing)
for elem in listing:
    print(elem)

['For this string', " we'll split", ' on', ' commas', ' okay?']
For this string
 we'll split
 on
 commas
 okay?


So for every comma in the string, the string is split on that character.  By default, if you don't pass in anything, then the split function splits on white space characters.

Now that we can split on specific strings, let's look at the `startswith` and `endswith` methods.  These methods will be used to do some baseline analysis of the text!  

In [6]:
greeting_words = ["Hello", "Hi", "How are you?"]
goodbye_words = ["Goodbye", "See you", "See you!"]

with open("alice_in_wonderland.txt","r") as f:
    text = f.read()
lines = text.split("\n")
greetings = 0
goodbyes = 0
for line in lines:
    if any([line.startswith(elem) for elem in greeting_words]):
        greetings += 1
    if any([line.startswith(elem) for elem in goodbye_words]):
        goodbyes += 1
greetings, goodbyes

(0, 0)

In [7]:
greeting_words = ["Hello", "Hi", "How are you?"]
goodbye_words = ["Goodbye", "See you", "See you!"]

with open("alice_in_wonderland.txt","r") as f:
    text = f.read()
lines = text.split("\n")
greetings = 0
goodbyes = 0
for line in lines:
    if any([line.endswith(elem) for elem in greeting_words]):
        greetings += 1
    if any([line.endswith(elem) for elem in goodbye_words]):
        goodbyes += 1
greetings, goodbyes

(0, 0)

Looks like our set of greetings and goodbyes didn't yield any results, I guess folks aren't very friendly!  Let's expand the list by ignoring case, but how do we do that?!  Enter `upper` and `lower`.

In [8]:
string = "Let's make this text all uppercase"
string.upper()

"LET'S MAKE THIS TEXT ALL UPPERCASE"

In [9]:
string = "Let's make ThIs tExT all lowercase"
string.lower()

"let's make this text all lowercase"

Now let's see if any of the text words for upper or lower case versions of the text!

In [10]:
greeting_words = ["hello", "hi", "how are you?"]
goodbye_words = ["goodbye", "see you", "see you!"]

with open("alice_in_wonderland.txt","r") as f:
    text = f.read()
lines = text.split("\n")
greetings = 0
goodbyes = 0
for line in lines:
    line = line.lower()
    if any([line.startswith(elem) for elem in greeting_words]):
        greetings += 1
    if any([line.startswith(elem) for elem in goodbye_words]):
        goodbyes += 1
    if any([line.endswith(elem) for elem in greeting_words]):
        greetings += 1
    if any([line.endswith(elem) for elem in goodbye_words]):
        goodbyes += 1
greetings, goodbyes

(9, 1)

Pay dirt!  We got some greetings and goodbyes!  It turns out sometimes, people are friendly in alice in wonderland after all!  Now let's see if any of our greeting words or goodbye words are anywhere on any line in the text.  We'll use find to do this :)

In [12]:
string = "Hello there friend"
print(string.find("there"))
print(string.find("whatever"))

6
-1


So find either, finds the occurrence of the string or returns -1

In [14]:
greeting_words = ["hello", "hi", "how are you?"]
goodbye_words = ["goodbye", "see you", "see you!"]

with open("alice_in_wonderland.txt","r") as f:
    text = f.read()
lines = text.split("\n")
greetings = 0
goodbyes = 0
for line in lines:
    line = line.lower()
    greetings_found = [elem for elem in greeting_words if line.find(elem) != -1]
    goodbyes_found = [elem for elem in goodbye_words if line.find(elem) != -1]
    greetings += len(greetings_found)
    goodbyes += len(goodbyes_found)
greetings, goodbyes

(736, 4)

Using the above method we are able to effectively search the text, in it's entirety for occurrences of the six phrases of interest!  At this point, we've likely found all the instances of those words.  But this leads us to a general point.  We can search text for words or phrases!!!

Now let's use our new found powers to something really cool - let's spell correct a bunch of text.

In [21]:
from autocorrect import spell
import time
with open("alice_in_wonderland.txt","r") as f:
    text = f.read()
lines = text.split("\n")
new_lines = []
total_misspellings = 0
misspellings_per_line = []
start = time.time()
for index,line in enumerate(lines):
    tokens = line.split()
    new_tokens = []
    misspellings = 0
    for token in tokens:
        correct_spelling = spell(token)
        if correct_spelling != token:
            total_misspellings += 1
            misspellings += 1
        new_tokens.append(correct_spelling)
    misspellings_per_line.append(misspellings)
    new_string = " ".join(new_tokens)
    new_lines.append(new_string)
new_text = "\n".join(new_lines)
with open("correctly_spelled_alice_in_wonderland.txt", "w") as f:
    f.write(new_text)

Just for fun, let's do a little bit of analysis on the number of misspellings on our text:

In [23]:
import statistics
print(statistics.mean(misspellings_per_line))
print(total_misspellings)

1.6383832976445396
6121


So we made a ton of corrections to this document!  Pretty good.  Okay, okay.  So if we turn this into code, then we'll still need to point our code at a bunch of files, can we do better?

Turns out we can!  Remember last week, when we learned how to move between directories, let's apply some of that knowledge now!

In [27]:
import os
from glob import glob
from autocorrect import spell

def spell_correct(file_path):
    with open(file_path, "r") as f:
        text = f.read()
        lines = text.split("\n")
    new_lines = []
    for line in lines:
        tokens = line.split()
        new_tokens = []
        for token in tokens:
            new_tokens.append(spell(token))
        new_line = " ".join(new_tokens)
        new_lines.append(new_line)
    return "\n".join(new_lines)

current_dir = os.getcwd()
dirs = []
traversed_dirs = [current_dir]
previous_dir = ""
while current_dir != previous_dir:
    print("currently process", current_dir)
    files = [os.path.abspath(file) for file in glob("*") if os.path.isfile(file)]
    dirs += [os.path.abspath(directory) for directory in glob("*") if os.path.isdir(directory)]
    dirs = [directory for directory in dirs if directory not in traversed_dirs]
    print("beginning file processing")
    for file in files:
        correctly_spelled_file = spell_correct(file)
        with open(file,"w") as f:
            f.write(correctly_spelled_file)
    print("wrote out all files with spell correction")
    previous_dir = current_dir
    current_dir = dirs.pop()
    traversed_dirs.append(current_dir)
    os.chdir(current_dir)

currently process /Users/ericschles/Documents/projects/python_courses/an_introduction_to_python


KeyboardInterrupt: 