# Preprocessing text for Deep Learning

This notebooks performs basic preprocessing on the Quora datasets using the method(s) from `common.nlp.sequence_preprocessing`. The goal is to prepare the data for using sequence models. This means some information, such as punctuation and capitals, are removed. This information should be captured in another process if we want to use it.

In [None]:
import sys
sys.path.append("../..")
import numpy as np
import pandas as pd

from common.nlp.sequence_preprocessing import WORD_MAP, preprocess_text_for_DL

In [None]:
# load data with right data types (this is important for the IDs in particular)
dtypes = {"qid": str, "question_text": str, "target": int}
train = pd.read_csv("../data/train.csv", dtype=dtypes)
test = pd.read_csv("../data/test.csv", dtype=dtypes)

### Preprocessing steps
The `preprocess_text_for_DL` method performs the following steps:
1. Convert text to lower case
2. Replace shorthand writing (such as `won't`) to their full form (`will not`)
3. Remove punctuation

The `common.nlp.sequence_preprocessing` module contains a variable `WORD_MAP` that specifies the mapping to use in step 2. This mapping is used by default, but can be overridden in the function call.

In [None]:
WORD_MAP

The punctuation that is removed is that from `string.punctuation`.

In [None]:
import string
string.punctuation

Preprocessing can be done for a single dataset or more datasets at once (all positional arguments are assumed to be datasets).

In [None]:
# one dataset example
just_train = preprocess_text_for_DL(train, text_col="question_text", word_map=WORD_MAP)
just_train.head()

In [None]:
# multiple dataset example
cleaned_train, cleaned_test = preprocess_text_for_DL(train, test)
cleaned_train.shape, cleaned_test.shape

### Results
Let's look at some (random) results.

In [None]:
def sample_result(idx=None):
    if idx is None:
        idx = np.random.randint(0, len(cleaned_train))
    print(idx)
    print(train["question_text"].iloc[idx])
    print(cleaned_train["question_text"].iloc[idx])

In [None]:
# run this cell to see a random preprocessed instance and its original
sample_result()

#### Some examples we might need to deal with differently...

In [None]:
sample_result(1287932)

Should hyphens be replaced with a space instead of removed?

In [None]:
sample_result(378162)

With slashes, maybe we should just choose one of the options (e.g., the most common one in the corpus?) to maintain a logical sentence?

In [None]:
sample_result(1012331)

Maybe we should replace $ with `dollar` and replace general abbreviations like y.o. (years old) with full words?

Also, we probably have to replace digits.

In [None]:
sample_result(1068603)

What about `'s` that indicates possesion? By removing the apostrofe, the word turns into plural form or results in a misspelling of a name like here.

In [None]:
sample_result(535806)

Some weird characters are still there apparently.

__For the biggest part it seems to work fine though.__