> # Trip Advisory Dataset

Customer Happiness has become the top priority for service & product based companies. To an extent, some of the companies even appoint CHOs (Customer Happiness Officer) to ensure delivering a kickass and delightful customer experience.

Well, machine learning is now playing a pivotal role in delivering that experience. The ability to predict happy and unhappy customers give companies a nice head-start to improve their experience.


### Problem Statement
TripAdvisor is the world's largest travel site where you can compare and book hotels, flights, restaurants etc. The data set provided in this challenge consists of a sample of hotel reviews provided by the customers. Analysing customers reviews will help them understand about the hotels listed on their website i.e. if they are treating customers well or if they are providing hospitality services as expected.

In this challenge, you have to predict if a customer is happy or not happy.

Data Description
You are given three files to download: train.csv, test.csv and sample_submission.csv The training data has 38932 rows and test data has 29404 rows.

<b>
Variables	Description</b>
___ 
<i>User_ID</i>:	
unique ID of the customer

<i>Description</i>:	
description of the review posted

<i>Browser_Used</i>:	
browser used to post the review

<i>Device_Used</i>:	
device used to post the review

<i>Is_Response</i>:	
target Variable
___

> #### Submission
A participant has to submit a csv file containing User_ID and predicted labels in a csv format. Check the sample submission file for format.

Scripts :

CatBoost, LightGBM, NaiveBayes (Python) - [Click Here](https://github.com/HackerEarth-Challenges/Happiness-ML-Challenge/blob/master/LGB_CB_Python.ipynb)


## 1. Load Data

In [None]:
!pip install nltk

In [27]:
# Import libs

import pandas as pd
import numpy as np
import nltk 
import tensorflow as tf
import keras 

import matplotlib.pyplot as plt

In [28]:
dataset = pd.read_csv('train.csv')

> #### View Data

In [29]:
dataset.head()

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
0,id10326,The room was kind of clean but had a VERY stro...,Edge,Mobile,not happy
1,id10327,I stayed at the Crown Plaza April -- - April -...,Internet Explorer,Mobile,not happy
2,id10328,I booked this hotel through Hotwire at the low...,Mozilla,Tablet,not happy
3,id10329,Stayed here with husband and sons on the way t...,InternetExplorer,Desktop,happy
4,id10330,My girlfriends and I stayed here to celebrate ...,Edge,Tablet,not happy


In [30]:
dataset.describe().T

Unnamed: 0,count,unique,top,freq
User_ID,38932,38932,id13079,1
Description,38932,38932,This is a quaint property walking distance fro...,1
Browser_Used,38932,11,Firefox,7367
Device_Used,38932,3,Desktop,15026
Is_Response,38932,2,happy,26521


In [31]:
dataset['Is_Response'].unique()

array(['not happy', 'happy'], dtype=object)

We are more concern about 'Description' and 'Is_Response' cols.

Steps to Solve this sentiment analysis problem are:
1. Prepare data
2. Feature Extraction 
3. Build Model
4. Train Model
5. Checing Performance

# 1. Data Preparation:

In [32]:
from sklearn.preprocessing import LabelEncoder


def data_preparation(file_path):
    data = pd.read_csv(file_path)
    
    # customer data 
    reviews = []
    
    # Result: Response is 'not happy', 'happy'
    labels = []
    
    
    # encoding Response result to categorical Data (0,1)
    labelencoder = LabelEncoder()
    data ['Is_Response'] = labelencoder.fit_transform(data ['Is_Response'])
    
    
    for i in range(0,  len(data['Description']) ):
        reviews.append( data['Description'][i] )
        labels.append ( data['Is_Response'][i] )
    
    labels = np.asarray( labels )
    
    return reviews, labels
    

Before going further it is better to go through the keras basic sentiment analysis

In [33]:
from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent! work']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 3), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
5
{'excellent': 8, 'good': 4, 'great': 5, 'done': 3, 'effort': 6, 'nice': 7, 'well': 2, 'work': 1}
{'excellent': 1, 'good': 1, 'great': 1, 'done': 1, 'effort': 1, 'nice': 1, 'well': 1, 'work': 3}
[[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 1.]]


# 2. Feature Extraction

In this problem bag of words model can be used for creating feature vectors.

Following will be steps:

1. Creating a dictionary of word-index pair or tuple of all distinct words in traininig reviews

2. Now Convert each review words into word index array

3. Then add all Sparse arrays of each review will be in Matrix

![image.png](attachment:image.png)

In [34]:
# Text tokenization utility class
import json
import numpy as np
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.models import model_from_json

In [35]:
# preprocess training data

train_file_path = "./train.csv"
[reviews,labels] = data_preparation(train_file_path)



In [36]:

# Create Dictionary of words and their indices

# assume
max_words = 10000 
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(reviews)
dictionary = tokenizer.word_index
# As we did above sample


In [37]:
# save dictionary
with open('dictionary.json','w') as dictionary_file:
    json.dump(dictionary,dictionary_file)

In [38]:
def convert_text_to_index_array(text):
    words = kpt.text_to_word_sequence(text)
    wordIndices = []
    for word in words:
        if word in dictionary:
            wordIndices.append(dictionary[word])
    return wordIndices

In [39]:
# Replace words of each text review to indices
allWordIndices = []
for num,text in enumerate(reviews):
    wordIndices = convert_text_to_index_array(text)
    allWordIndices.append(wordIndices)

In [40]:
# Convert the index sequences into binary bag of words vector (one hot encoding) 

allWordIndices = np.asarray(allWordIndices)
train_X = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')
labels = keras.utils.to_categorical(labels,num_classes=2)



![image.png](attachment:image.png)


# 3. Building Model

we will be using Dense fully connected NN.

In [41]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout



# Creating a Dense Neural Network Model 
model = Sequential() 
model.add(Dense(256, input_shape=(max_words,), activation='elu')) 
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5)) 
model.add(Dense(2, activation='softmax'))


# 4. Training  Model

In [42]:

model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) 

model.fit(train_X, labels, batch_size=32, epochs=5, verbose=1, validation_split=0.1, shuffle=True)



Train on 35038 samples, validate on 3894 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1ae811e6390>

In [None]:
# Save model to disk
model_json = model.to_json()
with open('model.json', 'w') as json_file:
    json_file.write(model_json)
model.save_weights('model.h5')  

import json
import numpy as np
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.models import model_from_json
import pandas as pd

def convert_text_to_index_array(text):
    words = kpt.text_to_word_sequence(text)
    wordIndices = []
    for word in words:
        if word in dictionary:
            wordIndices.append(dictionary[word])
    return wordIndices

# Load the dictionary
labels = ['happy','not_happy']
with open('dictionary.json', 'r') as dictionary_file:
    dictionary = json.load(dictionary_file)

# Load trained model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
model.load_weights('model.h5')

testset = pd.read_csv("./test.csv")    
cLen = len(testset['Description'])
tokenizer = Tokenizer(num_words=10000)

# Predict happiness for each review in test.csv
y_pred = []   
for i in range(0,cLen):
    review = testset['Description'][i]
    testArr = convert_text_to_index_array(review)   
    input = tokenizer.sequences_to_matrix([testArr], mode='binary')
    pred = model.predict(input)
    
    y_pred.append(labels[np.argmax(pred)])


# Write the results in submission csv file
raw_data = {'User_ID': testset['User_ID'], 
        'Is_Response': y_pred}
df = pd.DataFrame(raw_data, columns = ['User_ID', 'Is_Response'])
df.to_csv('submission_model1.csv', sep=',',index=False)