## <span style="color:#730101">Data Pre-Processing</span>

In this notebook, we create tokenizer for train dataset and create input and output variables for both train and test data.

##### <span style="color:#3A3A3A">Import Libraries</span>

In [1]:
import numpy as np
import pandas as pd
import pickle
import os
import tensorflow as tf

from typing import Tuple, List
from sklearn.model_selection import train_test_split
from collections import Counter
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

pd.options.mode.chained_assignment = None

In [2]:
# Import Dataset
with open('Nevada.pkl', 'rb') as nevada:
    Nevada = pickle.load(nevada)

In [3]:
# Extract from the dataset the column that will be used as the input variable
X = Nevada[['input_text']]

In [4]:
# Extract from the dataset the column that will be used as the output variable
Y = Nevada[["review_stars"]]

##### <span style="color:#3A3A3A">Split in Τrain-Test</span>

At first, we will split the Original dataset into two parts:
* Train-Validation dataset
* Test dataset 

In [5]:
X_train, X_test , Y_train , Y_test = train_test_split(X , Y , test_size = 0.2 , random_state=42)

In [6]:
# Count the words in the input text and find their frequency
l = " ".join(X['input_text']).split(" ")
c = Counter(l)
labels, values = zip(*[(i, c[i]) for i, count in c.most_common(len(c))])
indexes = np.arange(len(labels))

#The number of the words in x
print(len(labels))

37149


In [7]:
# We will use the 90% of the words in our dataset as the max words input in our tokenizer
0.90*len(labels)

33434.1

##### <span style="color:#3A3A3A">Keras Tokenizer</span>

We have created our tokenizer based on this [Github](https://github.com/keras-team/keras/issues/8092#issuecomment-372833486) thread

In [8]:
# We will consider the most used words in this dataset
max_words_text = round(0.90*len(labels))

input_text = X_train.loc[:,'input_text']

# Setting up Keras tokenizer
x_train_tokenizer = Tokenizer(num_words=max_words_text, filters=" ", lower=True, split=' ', oov_token = '<OOV>')  

# Generate tokens by counting frequency
x_train_tokenizer.fit_on_texts(list(input_text))  

x_train_tokenizer.word_index = {e:i for e,i in x_train_tokenizer.word_index.items() if i <= max_words_text}

print("Number of words mapped: {0}".format(len(x_train_tokenizer.word_index)))

# Turn text into sequence of numbers, not one-hot-encoding
X_train.loc[:, 'text_seqs'] = x_train_tokenizer.texts_to_sequences(input_text)

Number of words mapped: 33434


In [9]:
# Print the vocabulary based on the tokenizer
x_train_tokenizer.word_index

{'<OOV>': 1,
 'food': 2,
 'good': 3,
 'not': 4,
 'place': 5,
 'great': 6,
 'service': 7,
 'american': 8,
 'order': 9,
 'time': 10,
 'restaurant': 11,
 'chicken': 12,
 'eat': 13,
 'best': 14,
 'no': 15,
 'delicious': 16,
 'menu': 17,
 'nice': 18,
 'table': 19,
 'wait': 20,
 'love': 21,
 'taste': 22,
 'friendly': 23,
 'mexican': 24,
 'staff': 25,
 'experience': 26,
 'japanese': 27,
 'amaze': 28,
 'pretty': 29,
 'sauce': 30,
 'server': 31,
 'meal': 32,
 'fresh': 33,
 'italian': 34,
 'people': 35,
 'dinner': 36,
 'night': 37,
 'price': 38,
 'bad': 39,
 'bar': 40,
 'salad': 41,
 'recommend': 42,
 'cheese': 43,
 'asian': 44,
 'rice': 45,
 'lunch': 46,
 'day': 47,
 'well': 48,
 'lot': 49,
 'favorite': 50,
 'flavor': 51,
 'small': 52,
 'seat': 53,
 'steak': 54,
 'meat': 55,
 'excellent': 56,
 'beef': 57,
 'worth': 58,
 'awesome': 59,
 'sit': 60,
 'breakfast': 61,
 'leave': 62,
 'super': 63,
 'big': 64,
 'chinese': 65,
 'pizza': 66,
 'bit': 67,
 'happy': 68,
 'location': 69,
 'fry': 70,
 'quali

In [10]:
# Create a list with the length of all rows
text_lengths =  list(X_train.text_seqs.apply(len)) 
# Take 95% quartile of the lengths in order to keep revelant words and remove outliers
maxlentext = int(np.percentile(text_lengths, q=95))
# Make all sequences 84 words long
maxlentext 

84

In [11]:
''' Padding Text Sequences '''

# To proceed, we now have to make sure that all text sequences we feed into the model have the same length.
# We can do this with Keras pad sequences tool. It cuts of sequences that are too long and adds zeros to sequences 
# that are too short.

text_sequences = X_train.loc[:, 'text_seqs']

text_data = (pad_sequences(text_sequences, padding='post', maxlen=maxlentext))

# We have 811K, 84 word sequences now
print('New data shape: {}'.format(text_data.shape))

X_train['text_padded'] = text_data.tolist()

New data shape: (811035, 84)


In [12]:
''' Vectorized training data '''

# label list as an integer tensor

X_train = np.array([i for i in X_train.text_padded.tolist()])
Y_train = np.array([i for i in Y_train.review_stars.tolist()])

assert X_train.shape[0]==Y_train.shape[0]

In [13]:
''' Apply tokenizer on the test data '''

input_text = X_test.loc[:,'input_text']

X_test.loc[:, 'text_seqs'] = x_train_tokenizer.texts_to_sequences(input_text)

In [14]:
# To proceed, we now have to make sure that all text sequences we feed into the model have the same length.
# We can do this with Keras pad sequences tool. It cuts of sequences that are too long and adds zeros to sequences 
# that are too short.

text_sequences = X_test.loc[:, 'text_seqs']

text_test_data = (pad_sequences(text_sequences, padding='post', maxlen=maxlentext))

# We have 202K, 84 word sequences now
print('New data shape: {}'.format(text_test_data.shape))

New data shape: (202759, 84)


In [15]:
X_test['text_padded'] = text_test_data.tolist()

In [16]:
''' Vectorized test data '''

# label list as an integer tensor

X_test = np.array([i for i in X_test.text_padded.tolist()])
Y_test = np.array([i for i in Y_test.review_stars.tolist()])

assert X_test.shape[0]==Y_test.shape[0]

In [17]:
# The training dataset contains, now, 811035 records.
X_train.shape

(811035, 84)

In [18]:
# The test dataset contains, now, 202759 records.
X_test.shape

(202759, 84)

In [19]:
# Save the datasets and the tokenizer
np.save("X_train", X_train)
np.save("Y_train", Y_train)

np.save("X_test", X_test)
np.save("Y_test", Y_test)

with open('x_train_tokenizer.pkl', 'wb') as train:
    pickle.dump(x_train_tokenizer, train, protocol=pickle.HIGHEST_PROTOCOL)