# Preprocessing text for Deep Learning

This notebooks performs basic preprocessing on the Quora datasets using the method(s) from `common.nlp.sequence_preprocessing`. The goal is to prepare the data for using sequence models. This means some information, such as punctuation and capitals, are removed. This information should be captured in another process if we want to use it.

In [53]:
import sys
sys.path.append("../..")
import numpy as np
import pandas as pd

from common.nlp.sequence_preprocessing import WORD_MAP, preprocess_text_for_DL

In [54]:
# load data with right data types (this is important for the IDs in particular)
dtypes = {"qid": str, "question_text": str, "target": int}
train = pd.read_csv("../data/train.csv", dtype=dtypes)
test = pd.read_csv("../data/test.csv", dtype=dtypes)

### Preprocessing steps
The `preprocess_text_for_DL` method performs the following steps:
1. Convert text to lower case
2. Replace shorthand writing (such as `won't`) to their full form (`will not`)
3. Selectively remove, retain or ignore punctuations
4. Replace numbers with # (eg: 1 by #, 23 by ##, 1993 by ####)

The `common.nlp.sequence_preprocessing` module contains a variable `WORD_MAP` that specifies the mapping to use in step 2. This mapping is used by default, but can be overridden in the function call.

In [55]:
WORD_MAP

{"aren't": 'are not',
 "didn't": 'did not',
 "doesn't": 'does not',
 "don't": 'do not',
 "hadn't": 'had not',
 "hasn't": 'has not',
 "he's": 'he is',
 "how's": 'how is',
 "i'm": 'i am',
 "i've": 'i have',
 "isn't": 'is not',
 "it's": 'it is',
 "she's": 'she is',
 "that's": 'that is',
 "they're": 'they are',
 "they've": 'they have',
 "we're": 'we are',
 "we've": 'we have',
 "what's": 'what is',
 "won't": 'will not',
 "you've": 'you have'}

The punctuation that is removed is that from `string.punctuation`.

In [56]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Preprocessing can be done for a single dataset or more datasets at once (all positional arguments are assumed to be datasets).

In [57]:
# one dataset example
just_train = preprocess_text_for_DL(train, text_col="question_text", word_map=WORD_MAP, puncts_ignore='/-', puncts_retain='&')
just_train.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,how did quebec nationalists see their province...,0
1,000032939017120e6e44,do you have an adopted dog how would you encou...,0
2,0000412ca6e4628ce2cf,why does velocity affect time does velocity af...,0
3,000042bf85aa498cd78e,how did otto von guericke used the magdeburg h...,0
4,0000455dfa3e01eae3af,can i convert montra helicon d to a mountain b...,0


In [58]:
just_train['question_text'][7]

'is it crazy if i wash or wipe my groceries off germs are everywhere'

In [59]:
# multiple dataset example
cleaned_train, cleaned_test = preprocess_text_for_DL(train, test, puncts_ignore='/-', puncts_retain='&')
cleaned_train.shape, cleaned_test.shape

((1306122, 3), (56370, 2))

### Results
Let's look at some (random) results.

In [60]:
def sample_result(idx=None):
    if idx is None:
        idx = np.random.randint(0, len(cleaned_train))
    print(idx)
    print(train["question_text"].iloc[idx])
    print(cleaned_train["question_text"].iloc[idx])

In [61]:
# run this cell to see a random preprocessed instance and its original
sample_result()

324883
Why are there mezzanine levels in buildings?
why are there mezzanine levels in buildings


#### Some examples we might need to deal with differently...

In [62]:
sample_result(1287932)

1287932
Can Cloud Formation scripts be auto-generated from an AWS production runtime?
can cloud formation scripts be auto generated from an aws production runtime


Should hyphens be replaced with a space instead of removed? <b>(Handled!)</b>

In [63]:
sample_result(378162)

378162
What types of accommodation/ modifications to one's lifestyle would be necessary to live on venus?
what types of accommodation modifications to ones lifestyle would be necessary to live on venus


With slashes, maybe we should just choose one of the options (e.g., the most common one in the corpus?) to maintain a logical sentence? </b>(Handled!)</b>

In [64]:
sample_result(1012331)

1012331
What is your advice for a 27 y.o who got only 1$ in his pocket?
what is your advice for a ## yo who got only 1 in his pocket


Maybe we should replace $ with `dollar` and replace general abbreviations like y.o. (years old) with full words?

Also, we probably have to replace digits.

In [65]:
sample_result(1068603)

1068603
Was Sheldon Cooper's personality inspired by Odo from Star Trek?
was sheldon coopers personality inspired by odo from star trek


What about `'s` that indicates possesion? By removing the apostrofe, the word turns into plural form or results in a misspelling of a name like here.

In [66]:
sample_result(535806)

535806
Does the drug “LSD” have a negative impact on cognitive functions?
does the drug lsd have a negative impact on cognitive functions


Some weird characters are still there apparently.

__For the biggest part it seems to work fine though.__