<a href="https://colab.research.google.com/github/Shemka/ReviewsScoring/blob/master/reviews_scoring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Datasets download

In [0]:
!pip install -q kaggle

In [3]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"sheminy","key":"44fddbbd008dbc82ee9bd99f58b12fe9"}'}

In [0]:
!mkdir ~/.kaggle

!cp kaggle.json ~/.kaggle/

In [0]:
!chmod 600 ~/.kaggle/kaggle.json

In [6]:
# For reviews sentiment analysis
!kaggle datasets download -d jiashenliu/515k-hotel-reviews-data-in-europe

Downloading 515k-hotel-reviews-data-in-europe.zip to /content
  0% 0.00/48.0M [00:00<?, ?B/s] 10% 5.00M/48.0M [00:00<00:03, 14.2MB/s] 56% 27.0M/48.0M [00:00<00:01, 19.7MB/s]
100% 48.0M/48.0M [00:00<00:00, 88.9MB/s]


In [7]:
!unzip 515k-hotel-reviews-data-in-europe.zip

Archive:  515k-hotel-reviews-data-in-europe.zip
  inflating: Hotel_Reviews.csv       


## Pre-trained Embedding downloading

In [8]:
!wget "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
!gunzip GoogleNews-vectors-negative300.bin.gz

--2019-09-07 05:46:25--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.36.38
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.36.38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2019-09-07 05:47:12 (33.8 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



## Libraries and main functions/classes

In [0]:
!pip install -q tensorflow==2.0.0rc0

In [3]:
# MAIN
import pandas as pd
import numpy as np
from tqdm import tqdm
from collections import Counter
import gc
import pickle
import joblib

# NLP
import spacy
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from gensim.models import KeyedVectors
import gensim
from gensim.models import Word2Vec

# SKLEARN
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, mean_squared_error, r2_score

# TENSORFLOW
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Embedding, Reshape, Flatten, Conv1D, SpatialDropout1D, MaxPooling1D, Dense, GRU, LSTM, Dropout, BatchNormalization, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
from tensorflow.keras.callbacks import LearningRateScheduler, EarlyStopping

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
# Save given model into given path (include file name) 
def save_model(model, path='model.pkl'):
    # --- PARAMS ---
    # 'model' is a model you want to save (don't forget to train it before save)
    # 'path' is a path to file where you want to have model (extension must be '.pkl')

    joblib.dump(model, path, compress=1)
    print('Model was saved in', path)
    print('Use load_model to load model from path.')

class CleanUpText:
    def __init__(self):
        # NLP HELPERS
        self.stopwords = stopwords.words('english')
        self.nlp = spacy.load("en_core_web_sm")
        self.punctuations = string.punctuation

    # Return True or False
    def hasNumbers(self, inputString):
        return any(char.isdigit() for char in inputString)

    def fit(self, docs):
        pass
    
    def transform(self, docs):
        nlp = self.nlp
        stopwords = self.stopwords
        punctuations = self.punctuations

        texts = []
        cuts = {
            "'s": ' is',
            "'m": ' am',
            "'ll": ' will',
            "'re": ' are',
            "'d": ' would',
            "'ve": ' have'
        }
        for doc in docs:
            for key in cuts.keys():
                if key in doc:
                    doc = doc.replace(key, cuts[key])

            doc = nlp(doc, disable=['parser', 'ner', 'tagger'])
            
            tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
            tokens = [cuts[tok] if tok in cuts.keys() else tok for tok in tokens]
            tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations and not self.hasNumbers(tok)]
            texts.append(' '.join(tokens))
        return texts

    def fit_transform(self, docs, y=None):
        nlp = self.nlp
        stopwords = self.stopwords
        punctuations = self.punctuations

        texts = []
        cuts = {
            "'s": ' is',
            "'m": ' am',
            "'ll": ' will',
            "'re": ' are',
            "'d": ' would',
            "'ve": ' have'
        }
        for doc in tqdm(docs):
            for key in cuts.keys():
                if key in doc:
                    doc = doc.replace(key, cuts[key])

            doc = nlp(doc, disable=['parser', 'ner', 'tagger'])
            
            tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
            tokens = [cuts[tok] if tok in cuts.keys() else tok for tok in tokens]
            tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations and not self.hasNumbers(tok)]
            texts.append(' '.join(tokens))
        return texts

In [0]:
optimizer_adam = Adam(lr = 0.0005)

callbacks = [
    EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=2, verbose=1, restore_best_weights=True)
]

## Data description and preprocessing

In [6]:
reviews = pd.read_csv('Hotel_Reviews.csv')
reviews['Positive_Review'] = reviews['Positive_Review'].apply(lambda x: '' if x == 'No Positive' else x + ' ')
reviews['Negative_Review'] = reviews['Negative_Review'].apply(lambda x: '' if x == 'No Negative' else x + ' ')
reviews['review'] = reviews['Positive_Review'] + np.array([' ']*reviews['Negative_Review'].shape[0]) + reviews['Negative_Review']
reviews = reviews[['Reviewer_Score', 'review']]
reviews.head(5)

Unnamed: 0,Reviewer_Score,review
0,2.9,Only the park outside of the hotel was beauti...
1,7.5,No real complaints the hotel was great great ...
2,7.1,Location was good and staff were ok It is cut...
3,3.8,Great location in nice surroundings the bar a...
4,6.7,Amazing location and building Romantic settin...


In [7]:
text = CleanUpText().fit_transform(reviews['review'])

100%|██████████| 515738/515738 [03:47<00:00, 2264.07it/s]


## Data preporation for model

In [8]:
# Data splitting and tokenizing 
X_train, X_val, y_train, y_val = train_test_split(text, reviews['Reviewer_Score']/10, random_state=73, test_size=0.2)
# Vocabulary len
NUM_WORDS = 50000

tokenizer = Tokenizer(num_words=NUM_WORDS)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_val = tokenizer.texts_to_sequences(X_val)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 61207 unique tokens.


In [9]:
# Save tokenizer object
save_model(tokenizer, 'tokenizer.pkl')

Model was saved in tokenizer.pkl
Use load_model to load model from path.


In [10]:
%%time
# Sequence cropping
X_train = pad_sequences(X_train)
X_val = pad_sequences(X_val, maxlen=X_train.shape[1])
print('Shape of X train and X validation tensor:', X_train.shape, X_val.shape)

Shape of X train and X validation tensor: (412590, 371) (103148, 371)
CPU times: user 2.33 s, sys: 138 ms, total: 2.46 s
Wall time: 2.46 s


In [11]:
%%time
MAX_LEN = X_train.shape[1]
gwv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
EMBEDDING_DIM=300
vocabulary_size=min(len(word_index)+1,NUM_WORDS)
embedding_matrix = np.zeros((vocabulary_size, EMBEDDING_DIM))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


CPU times: user 29.4 s, sys: 3.11 s, total: 32.5 s
Wall time: 32.5 s


In [12]:
%%time
# Create embedding matrix
count_null = 0 
for word, i in word_index.items():
    if i>=NUM_WORDS:
        continue
    try:
        embedding_vector = gwv[word]
        embedding_matrix[i] = embedding_vector
    except KeyError:
        count_null += 1
        embedding_matrix[i]=np.random.normal(0,np.sqrt(0.25),EMBEDDING_DIM)
print('Count of string that not in word2vec vocabulary:', count_null)
print('From %d words' % (vocabulary_size))
del(gwv); gc.collect()

Count of string that not in word2vec vocabulary: 24155
From 50000 words
CPU times: user 1.05 s, sys: 37 ms, total: 1.08 s
Wall time: 1.08 s


## Model building and fitting

In [0]:
embedding_layer = Embedding(vocabulary_size,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_shape=(MAX_LEN,),
                            trainable=False)

In [14]:
model = Sequential()
model.add(embedding_layer)
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Conv1D(32, 11, activation='relu'))
model.add(MaxPooling1D(5))
model.add(SpatialDropout1D(0.2))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mean_absolute_error', optimizer=optimizer_adam, metrics=['mae', 'mse'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 371, 300)          15000000  
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 371, 300)          0         
_________________________________________________________________
bidirectional (Bidirectional (None, 371, 64)           64128     
_________________________________________________________________
conv1d (Conv1D)              (None, 361, 32)           22560     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 72, 32)            0         
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 72, 32)            0         
_________________________________________________________________
flatten (Flatten)            (None, 2304)              0

In [19]:
# It's just an example of training a model. Real model have another loss
model.fit(X_train, np.asarray(y_train), batch_size=2048, epochs=1, verbose=1, 
          callbacks=callbacks, validation_data=(X_val, np.asarray(y_val)))

Train on 412590 samples, validate on 103148 samples


KeyboardInterrupt: ignored

So, minimal MAE we took is around 0.08, It's not the best result that we can get, but already pretty well for our goals. 