 # Should you Clean your Data ? 

### A question that has appeared is whether or not one should apply preprocessing to the text. In fact, most popular kernels appear not to have done a lot, and still reached good scores (0.68-ish)

** In this kernel, I will use a simple Deep Learning model and compare its performance on pre-processed texts with usual methods and on raw texts. **

#### The model is the following one :
* GloVe Embedding
* Bidirectional GRU
* Attention 
* Dense 

I did not bother tunning it because it is good enough to highlight the point I'm trying to make


#### Feel free to give any feedback, it is always appreciated. (plz upvote !)

In [None]:
import numpy as np
import pandas as pd
import keras
import seaborn as sns
import matplotlib.pyplot as plt
import codecs
import unidecode
import re
import spacy
from nltk.corpus import stopwords
from time import time

from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.models import Model
from keras.layers import Dense, Embedding, Dropout, Bidirectional, CuDNNGRU, GlobalMaxPool1D, Input


import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

## Loading data

In [None]:
df = pd.read_csv("../input/train.csv")
print("Number of texts: ", df.shape[0])

In [None]:
for i in range(5):
    print(df['question_text'][df.index[i]])

** So far, texts look quite complicated, I'm going to apply some usual text processing techniques to simplify them **
* Dealing with contractions ('t, 've and other stuff)
* Removing numbers and special characters
* Lowering letters
* Removing Stopwords
* Lemmatization (keeping only the simple form of the word)

## Treating Texts

In [None]:
from nltk import WordNetLemmatizer

wnl = WordNetLemmatizer()

contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "can't've": "cannot have", "'cause": "because", "could've": "could have", "couldn't": "could not", "couldn't've": "could not have","didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not",  "he'd": "he would", "he'd've": "he would have", "he'll": "he will", "he'll've": "he will have", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not","sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have","that's": "that is", "there'd": "there would", "there'd've": "there would have","there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" } 

stop_words = set(stopwords.words('english'))

In [None]:
def treat_text(text):
    # Decoding
    try:
        decoded = unidecode.unidecode(codecs.decode(text, 'unicode_escape'))
    except:
        decoded = unidecode.unidecode(text)
        
    # Handling Apostrophes
    apostrophe_handled = re.sub("’", "'", decoded)
    text = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in apostrophe_handled.split(" ")])
    
    # Keeping letters + lowerring
    text = re.findall(r"[a-zA-Z]+", text.lower())
    
    # Removing stopwords
    text = [word for word in text if (word not in stop_words and len(word)>2)]
    
    # Lemming
    text = [wnl.lemmatize(word) for word in text]
    
    # Removing repetitions
    text = re.sub(r'(.)\1+', r'\1\1', ' '.join(text))
    
    return text

In [None]:
t0 = time()
print('Cleaning data ... ')
df['treated_text'] = df['question_text'].transform(treat_text)
print(f"Data cleaned in {round(time() - t0, 1)} seconds")

In [None]:
for i in range(5):
    print(df['treated_text'][df.index[i]])

Harder to understand, but much simpler !

## Text lengths

In [None]:
df['length'] = df['question_text'].transform(lambda x: len(x.split(' ')) // 5 * 5)
df['treated_length'] = df['treated_text'].transform(lambda x: len(x.split(' ')) // 5 * 5)

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(df['length'])
plt.title('Length repartiton (rounded down to the 5)')
plt.yscale('log')
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(df['treated_length'])
plt.title('Length repartiton (rounded down to the 5)')
plt.yscale('log')
plt.show()

Treating text lowers the length of texts, and therefore allows us to make a model with less parameters and a shorter training time.

In [None]:
max_len = 70
max_len_treated = 40

## Tokenizer

In [None]:
def make_tokenizer(texts, len_voc):
    from keras.preprocessing.text import Tokenizer
    t = Tokenizer(num_words=len_voc)
    t.fit_on_texts(texts)
    return t

In [None]:
len_voc = 50000

In [None]:
tokenizer = make_tokenizer(df['question_text'], len_voc)
tokenizer_treated = make_tokenizer(df['treated_text'], len_voc)

## Train/Test split

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.1)

## Data for the Network

In [None]:
X_train = tokenizer.texts_to_sequences(df_train['question_text'])
X_test = tokenizer.texts_to_sequences(df_test['question_text'])

In [None]:
X_train_treated = tokenizer_treated.texts_to_sequences(df_train['treated_text'])
X_test_treated = tokenizer_treated.texts_to_sequences(df_test['treated_text'])

In [None]:
from keras.preprocessing.sequence import pad_sequences

X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)

X_train_treated = pad_sequences(X_train_treated, maxlen=max_len_treated)
X_test_treated = pad_sequences(X_test_treated, maxlen=max_len_treated)

In [None]:
y_train = df_train['target'].values
y_test = df_test['target'].values

## Loading pre-trained word vectors

In [None]:
def get_coefs(word,*arr): 
    return word, np.asarray(arr, dtype='float32')

def load_embedding(file):
    if file == '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec':
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file) if len(o)>100)
    else:
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file, encoding='latin'))
    return embeddings_index

In [None]:
def make_embedding_matrix(embedding, tokenizer, len_voc):
    all_embs = np.stack(embedding.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]
    word_index = tokenizer.word_index
    embedding_matrix = np.random.normal(emb_mean, emb_std, (len_voc, embed_size))
    
    for word, i in word_index.items():
        if i >= len_voc:
            continue
        embedding_vector = embedding.get(word)
        if embedding_vector is not None: 
            embedding_matrix[i] = embedding_vector
    
    return embedding_matrix

In [None]:
glove = load_embedding('../input/embeddings/glove.840B.300d/glove.840B.300d.txt')

In [None]:
embed_mat = make_embedding_matrix(glove, tokenizer, len_voc)
embed_mat_treated = make_embedding_matrix(glove, tokenizer_treated, len_voc)

 ## Attention Layer
> Code from Khoi Ngyuen, check here : https://www.kaggle.com/suicaokhoailang/lstm-attention-baseline-0-652-lb

In [None]:
class Attention(Layer):
    def __init__(self, step_dim, W_regularizer=None, b_regularizer=None, W_constraint=None, b_constraint=None, bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')
        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)
        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)
        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)
        
    def build(self, input_shape):
        assert len(input_shape) == 3
        self.W = self.add_weight((input_shape[-1],), initializer=self.init, name='{}_W'.format(self.name), regularizer=self.W_regularizer, constraint=self.W_constraint)
        self.features_dim = input_shape[-1]
        if self.bias:
            self.b = self.add_weight((input_shape[1],), initializer='zero', name='{}_b'.format(self.name), regularizer=self.b_regularizer, constraint=self.b_constraint)
        else:
            self.b = None
        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim
        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
        if self.bias: eij += self.b
        eij = K.tanh(eij)
        a = K.exp(eij)
        if mask is not None: a *= K.cast(mask, K.floatx())
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0],  self.features_dim

## Making model

In [None]:
def make_model(embedding_matrix, max_len, len_voc=50000, embed_size=300):
    inp = Input(shape=(max_len,))
    x = Embedding(len_voc, embed_size, weights=[embedding_matrix], trainable=False)(inp)
    x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
    x = Attention(max_len)(x)
    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
model = make_model(embed_mat, max_len)
model_treated = make_model(embed_mat_treated, max_len_treated)

In [None]:
model.summary()

### Fitting

In [None]:
model.fit(X_train, y_train, batch_size=1024, epochs=3, validation_data=[X_test, y_test])

In [None]:
model_treated.fit(X_train_treated, y_train, batch_size=1024, epochs=3, validation_data=[X_test_treated, y_test])

### Predictions

In [None]:
pred_train = model.predict([X_train], batch_size=256, verbose=1)
pred_test = model.predict([X_test], batch_size=256, verbose=1)

In [None]:
pred_train_treated = model_treated.predict([X_train_treated], batch_size=256, verbose=1)
pred_test_treated = model_treated.predict([X_test_treated], batch_size=256, verbose=1)

### Tweaking threshold

In [None]:
def tweak_threshold(pred, truth):
    from sklearn.metrics import f1_score
    scores = []
    for thresh in np.arange(0.1, 0.501, 0.01):
        thresh = np.round(thresh, 2)
        score = f1_score(truth, (pred>thresh).astype(int))
        scores.append(score)
    return round(np.max(scores), 4)

In [None]:
print(f"Scored {tweak_threshold(pred_train, y_train)} without text treatment on train data")
print(f"Scored {tweak_threshold(pred_test, y_test)} without text treatment on test data")

In [None]:
print(f"Scored {tweak_threshold(pred_train_treated, y_train)} with text treatment on train data")
print(f"Scored {tweak_threshold(pred_test_treated, y_test)} with text treatment on test data")

** The model without treatment appears to significantly outperform the other one **

  I do believe that it is because GloVe Word Vectors are able to capture the information better when the words are not processed.
  
  In fact, word vectors are able to deal with number, most of special characters, words starting with an upper case letter (etc ...)
  
  Moreover, Keras' Tokenizer does the most basic preprocessing steps (lowercasing / punctuation removal)
   
## Conclusion : Text cleaning is a waste of time
### Well,  at least, the steps I chose are ...
One should focus on optimizing the model instead, and perhaps doing some data augmentation.
 