# Cleaning and Standardization
This notebook takes a text file of haikus saved in the form
```
sun-bleached billboard
the gravel road ends
at peaches

flute notes
fluttering
petals
```
and preprocesses them before creating a CSV file containing the preprocessed haikus.

To preprocess the haikus, I decided that they should retain apostophes, but no other punctuation. They should also retain numbers (for now). There are many hyphenated words and phrases, so hyphens and en dashes should be replaced with spaces. There are also unicode non-breaking spaces (`0xa0`), so those should be replaced with normal spaces. Duplicate spaces are replaced with a single space.

Further, duplicate haikus should be removed. To fascilitate duplicate removals, the line breaks are replaced with `/`'s, but the haikus are stored as lists of lines.

In [None]:
from multiprocessing.pool import Pool, ThreadPool

from tools.utils import read_from_file, preprocess, remove_stopwords, lemmatize

import pandas as pd

## Preprocessing

TODO: Consider leaving the preprocessing to spaCy?

To preprocess each haiku, it undergoes a sequence of steps. First, there are unicode non-breaking spaces (`0xa0`) that need to be replaced with normal spaces. Second, there are many hyphenated words with both hyphens and en-dashes, so replace both of those with spaces. Third, there are multiple instances of non-ascii characters leftover from downloading the haikus, so remove each of them. Fourth, replace all strings of multiple spaces with a single space. Last, ensure nothing got left out by the previous rules, and run every haiku through a filter, filtering out anything that isn't a lowercase ascii alphabetic character, space, single quote, forward slash (used to separate lines), and digits.

In [None]:
haikus = read_from_file()

# Remove duplicates in a manner that preserves order.
# Requires Python 3.6+
haikus = list(dict.fromkeys(haikus))

In [None]:
def process_haiku(haiku):
    haiku = preprocess(haiku)
    lemmatized = lemmatize(haiku)
    lemmatized = lemmatized.split("/")
    lemmatized = [l.strip() for l in lemmatized]
    nostops = [l.strip() for l in remove_stopwords(haiku).split("/")]
    haiku = [l.strip() for l in haiku.split("/")]

    return {
        "haiku": haiku,
        "nostops": nostops,
        "lemmas": lemmatized,
        "lines": len(haiku),
    }

In [None]:
# %%timeit -r 1 -n 1
# with Pool() as pool:
#     rows = pool.map(process_haiku, haikus)

In [None]:
# %%timeit -r 1 -n 1
# with ThreadPool() as pool:
#     rows = pool.map(process_haiku, haikus)

In [None]:
# %%timeit -r 1 -n 1
rows = list(map(process_haiku, haikus))

In [None]:
haikus = pd.DataFrame(rows)
haikus.tail()

In [None]:
haikus.to_csv('../haikus.csv')