# Text processing using Keras
In this kernel we will explore different modles that we can use for text processing.

## Content
1. [Data Preparation](#1) 
2. [Basic Model](#2)
3. [First Keras Model](#3)
4. [Word Embeddings](#4)
5. [Keras Embedding Layer](#5)
6. [Using Pretrained Word Embeddings](#6)
   1. [GloVe](#6-1)
   2. [Wiki](#6-2)
   3. [Word2Vec](#6-3)
7. [CONVNET](#7)
8. [CUDNNLSTM](#8)
9. [CUDNNGRU](#9)
10. [Create your model](#10)
11. [References](#11)

To be continued ...

<a id="1"></a>
# Data preparation

## Import the necessary libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from keras.models import Sequential
from keras import layers

In [None]:
!ls ../input

In [None]:
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")

Let have a look inside train data:

In [None]:
train_df.info()

The question column contains each sample text.

In [None]:
train_df.loc[1:3]["question_text"]

First we fill null entries from train and test data set.

In [None]:
X_train = train_df["question_text"].fillna("kh").values
X_test = test_df["question_text"].fillna("kh").values
y = train_df["target"]

For ploting performance of models we need this plot function.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

def plot_history(history):
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()

<a id="2"></a>
# Basic model: Logistic Regression 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(list(X_train))
X_train_vec = vectorizer.transform(list(X_train))
X_test_vec  = vectorizer.transform(list(X_test))

#feature selection
from sklearn.feature_selection import SelectKBest, chi2
max_features = 50000
ch2 = SelectKBest(chi2, max_features)
x_train = ch2.fit_transform(X_train_vec, y)
x_test = ch2.transform(X_test_vec)

from sklearn.model_selection import train_test_split
X_tra, X_val, y_tra, y_val = train_test_split(x_train, y, test_size = 0.1, random_state=42)

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(x_train, y)
pre = classifier.predict_proba(x_test)

# For submission 
#y_pre= [ np.argmax(i) for i in pre]
#submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": y_pre})
#submit_df.to_csv("submission.csv", index=False)

<a id="3"></a>
# First Keras Model

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_vec= vectorizer.fit_transform(list(X_train))
X_test_vec = vectorizer.transform(list(X_test))

feature_names = vectorizer.get_feature_names()
len(feature_names)

In [None]:
#feature selection
max_features = 50000
ch2 = SelectKBest(chi2, max_features)
x_train = ch2.fit_transform(X_train_vec, y)
x_test = ch2.transform(X_test_vec)

In [None]:
from keras.models import Sequential
from keras import layers

model = Sequential()
model.add(layers.Dense(10, input_dim=max_features, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

model.summary()

In [None]:
from sklearn.model_selection import train_test_split
X_tra, X_val, y_tra, y_val = train_test_split(x_train, y, test_size = 0.1, random_state=42)

#vars
batch_size = 32
epochs = 4

hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                  verbose=True)

y_pred = model.predict(x_test, batch_size=1024)

plot_history(hist)

y_te = (y_pred[:,0] > 0.5).astype(np.int)

#for submission
#submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": y_te})
#submit_df.to_csv("submission.csv", index=False)

In [None]:
# Performance 
loss, accuracy = model.evaluate(X_tra,y_tra, verbose=False)
print("Training split Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_val,y_val, verbose=False)
print("Validation Accuracy:  {:.4f}".format(accuracy))

<a id="4"></a>
# Word Embeddings

"The word embeddings do not understand the text as a human would, but they rather map the statistical structure of the language used in the corpus. Their aim is to map semantic meaning into a geometric space. This geometric space is then called the embedding space. vector arithmetic should become possible. A famous example in this field of study is the ability to map King - Man + Woman = Queen."

How to generate that:
1. One way is to train your word embeddings during the training of your neural network. 
2. The other way is by using pretrained word embeddings which you can directly use in your model. There you have the option to either leave these word embeddings unchanged during training or you train them also.

In [None]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(list(X_train))

x_train = tokenizer.texts_to_sequences(list(X_train))
x_test = tokenizer.texts_to_sequences(list(X_test))
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

In [None]:
print(list(X_train)[2])
x_train[2]

In [None]:
for word in [ 'why', 'does', 'velocity', 'affect', 'time']:
    print('{}: {}'.format(word, tokenizer.word_index[word]))

In [None]:
# make all tokenized sentences in same size
from keras.preprocessing.sequence import pad_sequences
maxlen = 100
x_train = pad_sequences(x_train, padding='post', maxlen=maxlen)
x_test = pad_sequences(x_test, padding='post', maxlen=maxlen)

In [None]:
print(list(X_train)[2])
print(x_train[2])

In [None]:
print(x_train.shape)

<a id="5"></a>
# Keras Embedding Layer
In this state we tokenized data and they are just hardcoded. We will learn new embedding space using keras embedding layer that takes the previously calculated integers and maps them to a dense vector of the embedding.

One way of using keras Embedding layer would be to take the output of the embedding layer and fed it into a Dense layer. In order to do this you have to add a Flatten layer in between that prepares the sequential input for the Dense layer:

In [None]:
from keras.models import Sequential
from keras import layers

embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

batch_size = 32
epochs = 4

from sklearn.model_selection import train_test_split
X_tra, X_val, y_tra, y_val = train_test_split(x_train, y, test_size = 0.1, random_state=42)

hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                  verbose=True)

y_pred = model.predict(x_test, batch_size=1024)
plot_history(hist)

y_te = (y_pred[:,0] > 0.5).astype(np.int)
#for submission 
#submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": y_te})
#submit_df.to_csv("submission.csv", index=False)

#performance
loss, accuracy = model.evaluate(X_tra,y_tra, verbose=False)
print("Training split Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_val,y_val, verbose=False)
print("Validation Accuracy:  {:.4f}".format(accuracy))

The other way is the usage of pooling layer. 

In [None]:
from keras.models import Sequential
from keras import layers

embedding_dim = 50

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

batch_size = 32
epochs = 5

from sklearn.model_selection import train_test_split
X_tra, X_val, y_tra, y_val = train_test_split(x_train, y, test_size = 0.1, random_state=42)

hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                  verbose=True)

y_pred = model.predict(x_test, batch_size=1024)
plot_history(hist)

y_te = (y_pred[:,0] > 0.5).astype(np.int)
#submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": y_te})
#submit_df.to_csv("submission.csv", index=False)

loss, accuracy = model.evaluate(X_tra,y_tra, verbose=False)
print("Training split Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_val,y_val, verbose=False)
print("Validation Accuracy:  {:.4f}".format(accuracy))

<a id="6"></a>
# Using Pretrained Word Embeddings

Instead of training embedded space, we can use a precomputed embedding space that trained on a much larger corpus. There are different approaches for generating embedding space:

1. The most popular methods are Word2Vec developed by Google; Word2Vec employes neural networks for training space.
2. GloVe (Global Vectors for Word Representation) developed by the Stanford NLP Group; GloVe achieves this with a co-occurrence matrix and by using matrix factorization.

Actually both of them applying dimensionality reduction techniquies, where Word2Vec is more accurate and GloVe is faster to compute.
For this experiment, inside ../input/embeddings/ directory there are different embedding space which are trained on different corpora:

In [None]:
! ls ../input/embeddings/*/

 ## Load the embedding matrix 

The following function help us to generate embedding matrix that will be load in our model. Each line in the file starts with the word and is followed by the embedding vector for the particular word. We don’t need all words, we just focus on only the words that we have in our vocabulary.

In [None]:
def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(filepath) as f:
        for line in f:
            word, *vector = line.split(' ')
            if word in word_index:
                idx = word_index[word] 
                embedding_matrix[idx] = np.array(vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix

<a id="6-1"></a>
## GloVe

In [None]:
embedding_dim = 50
embedding_matrix_glove = create_embedding_matrix('../input/embeddings/glove.840B.300d/glove.840B.300d.txt',
        tokenizer.word_index, embedding_dim)

#embedding_matrix.shape

nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix_glove, axis=1))
nonzero_elements / vocab_size

This means 0.55% of the vocabulary is covered by the pretrained model, let check coverage of other pretrained models.

<a id="6-2"></a>
## Wiki

In [None]:
embedding_matrix_wiki = create_embedding_matrix('../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec',
        tokenizer.word_index, embedding_dim)

nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix_wiki, axis=1))
nonzero_elements / vocab_size

<a id="6-3"></a>


As you see Glove pretrained model has better vocab coverage than wiki.

<a id="6-3"></a>
## Word2Vec

For this corpus we modified create embedding matrix to be compatible with google corpora,

In [None]:
from gensim.models import KeyedVectors
def create_embedding_matrix_google(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    wv_from_bin = KeyedVectors.load_word2vec_format(filepath, binary=True) 
    for word, vector in zip(wv_from_bin.vocab, wv_from_bin.vectors):
        if word in word_index:
                idx = word_index[word] 
                embedding_matrix[idx] = np.array(vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix

In [None]:
filepath = "../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin"

embedding_matrix_google = create_embedding_matrix_google(filepath,tokenizer.word_index, embedding_dim)
#embedding_matrix_google.shape
nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix_google, axis=1))
nonzero_elements / vocab_size

So we decide to use the first pretained model which is done by GloVe. Now we can use the word embeddings in our models. In the first model we use 

In [None]:
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix_glove], 
                           input_length=maxlen, 
                           trainable=False))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

batch_size = 32
epochs = 5

from sklearn.model_selection import train_test_split
X_tra, X_val, y_tra, y_val = train_test_split(x_train, y, test_size = 0.1, random_state=42)

hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                  verbose=True)

y_pred = model.predict(x_test, batch_size=1024)
plot_history(hist)

In the previous model the word embeddings are not additionally trained, now we will check model performs if we allow the embedding to be trained by using trainable=True:

In [None]:
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix_glove], 
                           input_length=maxlen, 
                           trainable=True))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

batch_size = 32
epochs = 5

from sklearn.model_selection import train_test_split
X_tra, X_val, y_tra, y_val = train_test_split(x_train, y, test_size = 0.1, random_state=42)

hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                  verbose=True)

y_pred = model.predict(x_test, batch_size=1024)
plot_history(hist)

y_te = (y_pred[:,0] > 0.5).astype(np.int)
#submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": y_te})
#submit_df.to_csv("submission.csv", index=False)

loss, accuracy = model.evaluate(X_tra,y_tra, verbose=False)
print("Training split Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_val,y_val, verbose=False)
print("Validation Accuracy:  {:.4f}".format(accuracy))

<a id="7"></a>
# CONVNET

In [None]:
embedding_dim = 100

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

batch_size = 32
epochs = 5

from sklearn.model_selection import train_test_split
X_tra, X_val, y_tra, y_val = train_test_split(x_train, y, test_size = 0.1, random_state=42)

hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                  verbose=True)

y_pred = model.predict(x_test, batch_size=1024)
plot_history(hist)

y_te = (y_pred[:,0] > 0.5).astype(np.int)
#submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": y_te})
#submit_df.to_csv("submission.csv", index=False)

loss, accuracy = model.evaluate(X_tra,y_tra, verbose=False)
print("Training split Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_val,y_val, verbose=False)
print("Validation Accuracy:  {:.4f}".format(accuracy))

<a id="8"></a>
# CuDNNLSTM

In [None]:
from keras.layers import LSTM, Dense, Bidirectional, Input,Dropout,BatchNormalization, CuDNNGRU, CuDNNLSTM

embedding_dim = 100
batch_size = 32
epochs = 1

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(CuDNNLSTM(128,return_sequences=True))
model.add(layers.GlobalMaxPooling1D())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

from sklearn.model_selection import train_test_split
X_tra, X_val, y_tra, y_val = train_test_split(x_train, y, test_size = 0.1, random_state=42)

hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                  verbose=True)

y_pred = model.predict(x_test, batch_size=1024)

y_te = (y_pred[:,0] > 0.5).astype(np.int)
#submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": y_te})
#submit_df.to_csv("submission.csv", index=False)

loss, accuracy = model.evaluate(X_tra,y_tra, verbose=False)
print("Training split Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_val,y_val, verbose=False)
print("Validation Accuracy:  {:.4f}".format(accuracy))

<a id="9"></a>
# CuDNNGRU

In [None]:
from keras.layers import LSTM, Dense, Bidirectional, Input,Dropout,BatchNormalization, CuDNNGRU, CuDNNLSTM

embedding_dim = 100
batch_size = 32
epochs = 1

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(CuDNNGRU(64, return_sequences=True))
model.add(layers.GlobalMaxPooling1D())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

from sklearn.model_selection import train_test_split
X_tra, X_val, y_tra, y_val = train_test_split(x_train, y, test_size = 0.1, random_state=42)

hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                  verbose=True)

y_pred = model.predict(x_test, batch_size=1024)

y_te = (y_pred[:,0] > 0.5).astype(np.int)
#submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": y_te})
#submit_df.to_csv("submission.csv", index=False)

loss, accuracy = model.evaluate(X_tra,y_tra, verbose=False)
print("Training split Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_val,y_val, verbose=False)
print("Validation Accuracy:  {:.4f}".format(accuracy))


<a id="10"></a>
# Create your model

This model comes from [keras starter](https://www.kaggle.com/christofhenkel/keras-starter) kernel. In this model we will define our custom model.

In [None]:
from keras.models import Model
from keras.layers import Input, Dense, Embedding, concatenate
from keras.layers import CuDNNGRU, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D

maxlen = 100
max_features = 50000

def get_model():
    inp = Input(shape=(maxlen, ))
    x = Embedding(max_features, 100)(inp)
    x = CuDNNGRU(64, return_sequences=True)(x)
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    conc = concatenate([avg_pool, max_pool])
    outp = Dense(1, activation="sigmoid")(conc)
    
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model

model = get_model()

In [None]:
batch_size = 32
epochs = 1

from sklearn.model_selection import train_test_split
X_tra, X_val, y_tra, y_val = train_test_split(x_train, y, test_size = 0.1, random_state=42)

hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                  verbose=True)

y_pred = model.predict(x_test, batch_size=1024)

y_te = (y_pred[:,0] > 0.5).astype(np.int)
submit_df = pd.DataFrame({"qid": test_df["qid"], "prediction": y_te})
submit_df.to_csv("submission.csv", index=False)

<a id="11"></a>
# References 
*  https://realpython.com/python-keras-text-classification/