# Demo #0, data preprocessing

For this demonstration, we will be working with a few small bits of text, specifically, Shakespeare's sonnets. This dataset is very small, likely too small to train a usable language model.  However, this dataset presents a few interesting challenges, and will give us an opportunity to do some preprocessing and learning how to use many NLP Tools.

The first step is to create a folder, then download the sonnets.  In this example, we will download them all as one big text document, then split them into the individual sonnets.

In [2]:
import wget
import os

try:
    os.makedirs('Sonnets')
except:
    pass
try:
    os.chdir('Sonnets')
except:
    pass
wget.download('https://shakespeare.folger.edu/downloads/txt/shakespeares-sonnets_TXT_FolgerShakespeare.txt')

'shakespeares-sonnets_TXT_FolgerShakespeare.txt'

There are many ways to break this file down into the sonnets, here is one way.  This step is not necessarily the most important, as long as all of the data has been preserved. We don't want to do too much processing right now, because we want the "ground truth observations" to be preserved in case the model needs to be retrained.

Generally, it's more understandable if all of the processing happens at once, just before the data is used to train a language model.

In [31]:
with open('shakespeares-sonnets_TXT_FolgerShakespeare.txt', 'r') as raw_text:
    for line_number, line in enumerate(raw_text):
        try:
            int(line)
            line_number = line.strip()
            sonnet_name = "Sonnet_" + line_number.zfill(3) + '.txt'
            print("Now Processing: "+ sonnet_name)
            active_sonnet = open(sonnet_name, 'w')
        except:
            if line == '\n':
                pass
            else:
                try:
                    active_sonnet.write(line)
                except:
                    pass

Now Processing: Sonnet_001.txt
Now Processing: Sonnet_002.txt
Now Processing: Sonnet_003.txt
Now Processing: Sonnet_004.txt
Now Processing: Sonnet_005.txt
Now Processing: Sonnet_006.txt
Now Processing: Sonnet_007.txt
Now Processing: Sonnet_008.txt
Now Processing: Sonnet_009.txt
Now Processing: Sonnet_010.txt
Now Processing: Sonnet_011.txt
Now Processing: Sonnet_012.txt
Now Processing: Sonnet_013.txt
Now Processing: Sonnet_014.txt
Now Processing: Sonnet_015.txt
Now Processing: Sonnet_016.txt
Now Processing: Sonnet_017.txt
Now Processing: Sonnet_018.txt
Now Processing: Sonnet_019.txt
Now Processing: Sonnet_020.txt
Now Processing: Sonnet_021.txt
Now Processing: Sonnet_022.txt
Now Processing: Sonnet_023.txt
Now Processing: Sonnet_024.txt
Now Processing: Sonnet_025.txt
Now Processing: Sonnet_026.txt
Now Processing: Sonnet_027.txt
Now Processing: Sonnet_028.txt
Now Processing: Sonnet_029.txt
Now Processing: Sonnet_030.txt
Now Processing: Sonnet_031.txt
Now Processing: Sonnet_032.txt
Now Proc

# Stop Words

Predefined stop words we can use in gensim are sourced from https://gist.github.com/sebleier/554280
We cannot use a stop-word list that includes gendered pronouns because much of shakespeare's work involved
talking about his two lovers, the golden boy and the dark lady, so gender is very important to our corpus
of text. We cannot use the default stop words, because it is inappropriate for the data we are looking at.

We must therefore define our own list of stop words.  I'm going to demonstrate how to create and edit a config
file using python, but you can do this just as easily using a plaintext editor like sublime or notepad.

In [15]:
stop_words = ['a', 'about', 'actually', 'almost', 'also', 'although', 
              'always', 'am', 'an', 'and', 'any', 'are', 'as', 'at', 
              'be', 'became', 'become', 'but', 'by', 'can', 'could', 
              'did', 'do', 'does', 'each', 'either', 'else', 'for', 
              'from', 'had', 'has', 'have', 'hence', 'how', 'i', 'if', 
              'in', 'is', 'it', 'its', 'just', 'may', 'maybe', 'me', 
              'might', 'mine', 'must', 'my', 'mine', 'must', 'my', 
              'neither', 'nor', 'not', 'of', 'oh', 'ok', 'when', 'where', 
              'whereas', 'wherever', 'whenever', 'whether', 'which', 'while', 
              'who', 'whom', 'whoever', 'whose', 'why', 'will', 'with', 
              'within', 'without', 'would', 'yes', 'yet', 'you', 'your']

In [26]:
with open('stopwords.txt', 'w') as stop_word_file:
    #stop_word_file.write('[')
    for word in stop_words:
            stop_word_file.write(word + '\n')
    #stop_word_file.write(']')