# Tutorial on how word-to-vector processing works

More word2vec info: 
    - https://www.tensorflow.org/tutorials/word2vec

In [9]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import seaborn as sns
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

Load training data

In [2]:
train = pd.read_csv('../input/train.csv')
print('Train data shape: {}'.format(train.shape))
print(train['comment_text'][2])

Train data shape: (159571, 8)
Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.


Fill NA/NaN values with _ na _

In [3]:
list_sentences_train = train["comment_text"].fillna("_na_").values

Define how many unique words to use (i.e num rows in embedding vector)

In [4]:
max_features = 20000 

Create Tokenizer object from Keras built in Tokenizer. Class for vectorizing texts, or/and turning texts into sequences (=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i). 

*num_words*: None or int. Maximum number of words to work with (if set, tokenization will be restricted to the top num_words most common words in the dataset).

In [5]:
tokenizer = Tokenizer(num_words=max_features)

Fit tokenizer to text dataset

In [6]:
tokenizer.fit_on_texts(list(list_sentences_train))

texts_to_sequences(texts)

Arguments: 
    - texts: list of texts to turn to sequences.
Return: 
    - list of sequences (one per text input).

In [16]:
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
print(list_tokenized_train[0:2])
print('Size of list_tokenized_train: {}'.format(len(list_tokenized_train)))

[[688, 75, 1, 126, 130, 177, 29, 672, 4511, 12052, 1116, 86, 331, 51, 2278, 11448, 50, 6864, 15, 60, 2756, 148, 7, 2937, 34, 117, 1221, 15190, 2825, 4, 45, 59, 244, 1, 365, 31, 1, 38, 27, 143, 73, 3462, 89, 3085, 4583, 2273, 985], [52, 2635, 13, 555, 3809, 73, 4556, 2706, 21, 94, 38, 803, 2679, 992, 589, 8377, 182]]
Size of list_tokenized_train: 159571


Pad all vectors to equal length

In [17]:
maxlen = 100 # max number of words in a comment to use
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
print(X_t[0:2])
print('Size of X_t: {}'.format(X_t.shape))

[[    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0   688    75     1   126   130   177    29
    672  4511 12052  1116    86   331    51  2278 11448    50  6864    15
     60  2756   148     7  2937    34   117  1221 15190  2825     4    45
     59   244     1   365    31     1    38    27   143    73  3462    89
   3085  4583  2273   985]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0    