## Working with Multiple Files

Most data problems with text involve dealing with multiple files.  You need a little bit of python to handle them.  We'll add a few utility functions to our collection in nlp_utilities.  The first is one to get the filenames from a directory of files, which is how your texts will usually be stored.

In [1]:
def get_filenames(folder):
    """ Pass in a folder name, with or without the / at end.
    Returns a list of the files & paths inside it (no folders).
    """
    from os import listdir
    from os.path import isfile, join
    # because we want to return full paths, we need to make sure there is
    # a / at the end.
    # If this doesn't work on Windows, change the slash direction.
    if folder[-1:] != "/":
        folder = folder + "/"
    # this will return only the filenames, not folders inside the path
    # also filter out .DS_Store which is on Macs.
    return [folder + f for f in listdir(folder) if isfile(join(folder, f)) and f != ".DS_Store"]

In [2]:
filenames = get_filenames("data/books/")
filenames

['data/books/Austen_Emma.txt',
 'data/books/Austen_Pride.txt',
 'data/books/edgar_allen_poe.txt',
 'data/books/lovecraft.txt',
 'data/books/Melville_MobyDick.txt',
 'data/books/mrjames_ghoststories.txt']

To read in the text from each of these files, we will create a dictionary, with the key being the filename and the value being the text (not tokens yet).

In [3]:
def load_texts_as_string(filenames):
    """ Takes a list of filenames as arg.
    Returns a dictionary with filename as key, string as value.
    """
    from collections import defaultdict
    loaded_text = defaultdict(str)  # each value is a string, the text
    for filename in filenames:
        with open(filename, errors="ignore") as handle:
            loaded_text[filename] = handle.read()
    return loaded_text

This is an example of running it on one file - it requires a list as input, so we wrap one of the filename items in [] :

In [8]:
#load_texts_as_string([filenames[5]])

In [4]:
#So to get the texts for each filename in the folder, we pass in the filenames we just collected:

texts = load_texts_as_string(filenames)

In [5]:
type(texts)

collections.defaultdict

In [6]:
# here is our list of files as keys:
texts.keys()

dict_keys(['data/books/Melville_MobyDick.txt', 'data/books/lovecraft.txt', 'data/books/mrjames_ghoststories.txt', 'data/books/Austen_Emma.txt', 'data/books/Austen_Pride.txt', 'data/books/edgar_allen_poe.txt'])

Note that to use this as a real list, we have to wrap it with the function `list()`.

In [7]:
list(texts.keys()) # [0]

['data/books/Melville_MobyDick.txt',
 'data/books/lovecraft.txt',
 'data/books/mrjames_ghoststories.txt',
 'data/books/Austen_Emma.txt',
 'data/books/Austen_Pride.txt',
 'data/books/edgar_allen_poe.txt']

### And to get the values, the text strings, it's

`texts.values()`

In [10]:
# we can also return the value using its key, of course:
texts['data/books/lovecraft.txt'][0:100]

'\n                        The\n                             Shunned House\n\n                        By '

## Here are some more utility functions and cleanup functions

Usually we don't want numbers, like chapter numbers or dates in our vocabulary. This function removes any token that doesn't have a letter in it, using python's regular expression search syntax.

In [14]:
import re
# the regular expression [a-zA-Z] means any letter in the alphabet,
# upper or lower case.  If we match against digits, there will be
# no hit.
print("Searching the string 12:", re.search('[a-zA-Z]', '12'))
print("Searching the string time12", re.search('[a-zA-Z]', 'time12'))
print("Searching the string 3rd", re.search('[a-zA-Z]', '3rd'))

Searching the string 12: None
Searching the string time12 <_sre.SRE_Match object; span=(0, 1), match='t'>
Searching the string 3rd <_sre.SRE_Match object; span=(1, 2), match='r'>


In [15]:
def remove_nonletters(wordlist):
    """ 
    Checks for letters in the token - using a regex search.
    Strings that are just numbers or remaining punctuation will 
    be removed! No more custom stop '--'. But ''s' will remain.
    """
    return [word for word in wordlist if re.search('[a-zA-Z]', word)]

Now we can update our cleaning function with this too:

In [16]:
def clean_tokens(tokens):
    """ Takes a list of tokens. Returns lowercased, minus punct 
    and stopwords and digits.
    """
    words = remove_punct(tokens)
    words = remove_stops(words)
    words = remove_nonletters(words)
    return words

Here's a function to stem tokens:

In [17]:
def stem_tokens(tokens):
    from nltk.stem.porter import PorterStemmer
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokens]

And here's a function that will tokenize, clean, and stem our text strings.

In [None]:
def tokenize_clean_stem(text):
    # takes an already read text and tokenizes stems and clean
    tokens = nltk.word_tokenize(text)
    tokens = clean_tokens(tokens)
    tokens = stem_tokens(tokens)
    return tokens

To use that, we need to import the file nlp_utilities.py that has all the other definitions (along with these new ones).

In [11]:
import nlp_utilities as mytools

### To see the functions available on this package name, in the juypyter notebook you can type the "." and then a tab. You will get a little popup menu!

In [12]:
mytools.

SyntaxError: invalid syntax (<ipython-input-12-333308f82d15>, line 1)

In [14]:
tokens = mytools.tokenize_clean_stem(texts['data/books/lovecraft.txt'])

In [15]:
tokens[0:20]

['shun',
 'hous',
 'h.',
 'p.',
 'lovecraft',
 'even',
 'greatest',
 'horror',
 'ironi',
 'seldom',
 'absent',
 'sometim',
 'enter',
 'directli',
 'composit',
 'event',
 'sometim',
 'relat',
 'fortuit',
 'posit']

In [16]:
tokens2 = mytools.tokenize_clean(texts['data/books/lovecraft.txt'])

In [17]:
tokens2[0:20]

['shunned',
 'house',
 'h.',
 'p.',
 'lovecraft',
 'even',
 'greatest',
 'horrors',
 'irony',
 'seldom',
 'absent',
 'sometimes',
 'enters',
 'directly',
 'composition',
 'events',
 'sometimes',
 'relates',
 'fortuitous',
 'position']

In [19]:
texts['data/books/lovecraft.txt'][0:200]

'\n                        The\n                             Shunned House\n\n                        By H. P. LOVECRAFT\n\n\n\nFrom even the greatest of horrors irony is seldom absent. Sometimes it\nenters dir'

## Style Guide

While we're talking about code, this is PEP8: the Python style guidelines. If you ever interview for a job using Python, you should be quite familiar with it. https://www.python.org/dev/peps/pep-0008/