# Extracting Vectors from a CharCNN Model

### Reading in the Data

Data taken from the public Kaggle dataset *"Trip Advisor Hotel Reviews"* found [here](https://www.kaggle.com/datasets/andrewmvd/trip-advisor-hotel-reviews)

In [1]:
# import necessary libraries and modules

import pandas as pd
from utils.preprocessing_utils import make_tokenizer, text_to_input_vectors
from utils.model_utils import make_model
from utils.vespa_utils import data_to_vespa_format
from tensorflow.keras.models import load_model
from tensorflow.keras import backend as K

In [2]:
# read in kaggle hotel review dataset and rename columns
df = pd.read_csv("./data/tripadvisor_hotel_reviews.csv")
df.columns = ['text', 'label']
df.head()

Unnamed: 0,text,label
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


In [3]:
# One hot encode the label column (rating of either 1-5)
# This is necessary for training the CharCNN model to interpret 5 different classes
df = pd.concat([df, pd.get_dummies(df['label'], prefix='rating')], axis=1)
df.head()

Unnamed: 0,text,label,rating_1,rating_2,rating_3,rating_4,rating_5
0,nice hotel expensive parking got good deal sta...,4,0,0,0,1,0
1,ok nothing special charge diamond member hilto...,2,0,1,0,0,0
2,nice rooms not 4* experience hotel monaco seat...,3,0,0,1,0,0
3,"unique, great stay, wonderful time hotel monac...",5,0,0,0,0,1
4,"great stay great stay, went seahawk game aweso...",5,0,0,0,0,1


### Preprocessing the Data

In this step, hotel review text is converted into character arrays to be fed to our CharCNN model. Process taken from [this Medium article](https://towardsdatascience.com/how-to-preprocess-character-level-text-with-keras-349065121089) by Xu Liang

In [4]:
# Preprocess the text data to be fed into the CharCNN
# A Tokenizer is turning each text string into an array of characters of size 256
# Each integer represents a letter (a = 1, b = 2, etc...)
# If text is shorter than 256 chars, pad 0's to the end, otherwise truncate to 256
tk = make_tokenizer()
inputs = text_to_input_vectors(df, tk, 256, 'text')

print("Before preprocessing:\n\n", df.iloc[0]['text'])
print("\n")
print("After preprocessing:\n\n", inputs[0])

Before preprocessing:

 nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, check quick easy, little disappointed non-existent view room room clean nice size, bed comfortable woke stiff neck high pillows, not soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway, maybe just noisy neighbors, aveda bath products nice, did not goldfish stay nice touch taken advantage staying longer, location great walking distance shopping, overall nice experience having pay 40 parking night,  


After preprocessing:

 [21 19  9  3 18 15 15 13 14  9  7  8 20 13 15 18 14  9 14  7 12 15 21  4
  2  1 14  7 19  4 15 15 18 19 15 16  5 14  9 14  7  3 12 15 19  9 14  7
  8  5  1 18 16  5 15 16 12  5 20  1 12 11  9 14  7  8  1 12 12 23  1 25
 38 13  1 25  2  5 10 21 19 20 14 15  9 19 25 14  5  9  7  8  2 15 18 19
 38  1 22  5  4  1  2  1 20  8 16 18 15  4 21  3 20 19 14  9  3  

### Training the Model

In this step, we finally train our CharCNN model. Implementation process taken from [this Medium article](https://towardsdatascience.com/character-level-cnn-with-keras-50391c3adf33) by Xu Liang

In [5]:
# Split half of data into training, half into testing set
# This CharCNN is only for demonstration purposes, so we'll use a trivial
# method for splitting the two sets by taking every other record in the data
train_data = inputs[::2]
test_data = inputs[1::2]

classes = [f"rating_{i}" for i in range(1,6)]

train_classes = df[classes].values[::2]
test_classes = df[classes].values[1::2]

In [6]:
# Initialize and train model, then save it
model = make_model(train_data, train_classes, test_data, test_classes, tk)
model.save("./model")

# Or, load model if already saved
# model = load_model("./model")
# model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, 256)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 256, 69)           4830      
_________________________________________________________________
conv1d (Conv1D)              (None, 250, 256)          123904    
_________________________________________________________________
activation (Activation)      (None, 250, 256)          0         
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 83, 256)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 77, 256)           459008    
_________________________________________________________________
activation_1 (Activation)    (None, 77, 256)           0     

### Extracting and Feeding the Vectors

Here, we create a Keras function for extracting embedding vectors from our model, create our vectors from our data, then format the data into JSON for feeding.

In [7]:
# After training model, create keras function that receives the normal
# input of the model, but outputs the embedding vectors of the intermediate
# dense layer INSTEAD of a probability distribution of predicted classes
embedding_func = K.function(
        [model.get_layer('input').input],
        model.get_layer('dense').output
    )

# Then, generate 10 embedding vectors to be put into Vespa
embeddings = embedding_func(inputs[:10])

In [8]:
# Get first 10 documents and convert them into Vespa-feedable JSON format
data_to_vespa_format(df.iloc[:10], embeddings, path="./data/reviews.json")

10 successfully saved at ./data/reviews.json
