# Cleaning and Standardization
This notebook takes a text file of haikus saved in the form
```
sun-bleached billboard
the gravel road ends
at peaches

flute notes
fluttering
petals
```
and preprocesses them before creating a CSV file containing the preprocessed haikus.

To preprocess the haikus, I decided that they should retain apostophes, but no other punctuation. They should also retain numbers (for now). There are many hyphenated words and phrases, so hyphens and en dashes should be replaced with spaces. There are also unicode non-breaking spaces (`0xa0`), so those should be replaced with normal spaces. Duplicate spaces are replaced with a single space.

Further, duplicate haikus should be removed. To fascilitate duplicate removals, the line breaks are replaced with `/`'s, but the haikus are stored as lists of lines.

In [1]:
import re
import string

import pandas as pd

# Preserve alphabetic, numeric, spaces, and single quotes.
ALPHABET = frozenset(string.ascii_lowercase + " " + "'" + "/" + string.digits)

In [2]:
def preprocess(text):
    """Preprocess the text of a haiku.
    
    Remove all punctuation, except for apostophes.
    Ensure ascii character set.
    """
    # Replace 0xa0 (unicode nbsp) with an ascii space
    text = text.replace('\xa0', ' ')
    # Replace hyphen and en dash with space
    text = text.replace('-', ' ')
    text = text.replace('–', ' ')
    # Remove all other non-ascii characters
    text = text.encode('ascii', 'ignore').decode('utf-8')
    # Remove redundant whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return ''.join(filter(ALPHABET.__contains__, text.lower()))

In [3]:
preprocess('fadf\xa0asd/f&!^#\'""ASDFIU')

"fadf asd/f'asdfiu"

In [4]:
# Use a set to remove duplicates.
haikus = set()
with open('haikus.txt', 'r', encoding='utf-8') as datafile:
    haiku = ''
    for line in datafile:
        line = line.strip()
        if line:
            # Separate lines with `/`, but not if it's the first line in the haiku
            if haiku:
                haiku += '/'
            haiku += line
        # A blank line an a nonempty haiku list signal the end of a haiku.
        elif not line and haiku:
            haikus.add(preprocess(haiku))
            haiku = ''
rows = []
for haiku in haikus:
    # Store the haikus as a list of lines, to fascilitate easier analysis.
    haiku = haiku.split('/')
    row = {
        'haiku': haiku,
        'lines': len(haiku),
    }
    rows.append(row)

In [5]:
haikus = pd.DataFrame(rows)
max_row = haikus['lines'].idxmax()
print(haikus.loc[max_row])
haikus.tail()

haiku    [sucking, chocolate squares, oh it's a lonely ...
lines                                                    5
Name: 866, dtype: object


Unnamed: 0,haiku,lines
23194,"[thunder, the farmers at the, credit union app...",3
23195,"[first date, letting her put snow, down my neck]",3
23196,"[migrating birds, i hear his voice in the wind...",3
23197,"[peace officers' memorial, the empty slots, fo...",3
23198,"[striking a match, dawn flashed in, the oval m...",3


In [None]:
haikus.to_csv('haikus.csv')