![Tripadvisor_lockup_horizontal_secondary_registered.svg](attachment:Tripadvisor_lockup_horizontal_secondary_registered.svg)
## Hotels play a crucial role in traveling and with the increased access to information new pathways of selecting the best ones emerged. 

# Challenges

* [Topic Modelling on reviews](#topic)
* [Explore Key Aspects that make hotels good or bad](#eda)
* [Predict review rating](#model)

# Dataset

In [None]:
# Data Manipulation & Visualization
import os
import pandas as pd
import numpy as np
import seaborn as sns # used for plot interactive graph. 
sns.set_style('darkgrid')
import matplotlib.pyplot as plt
import pickle as pk
from scipy import sparse as sp

# Text Manipulation
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from gensim.models import Phrases
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import gensim
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

# Machine Learning
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix,mean_squared_error,mean_absolute_error,log_loss,accuracy_score,classification_report
from sklearn.metrics import precision_score
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [None]:
df = pd.read_csv('../input/trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv')
df.head()

# Rating distribution in dataset

#### We can say that rating distribution is "left-skewed" since we have more 4-5 stars ratings in our dataset

In [None]:
plt.figure(figsize=(8,7))
sns.countplot(data=df,x="Rating",edgecolor='black',linewidth=3)
plt.title('Rating distribution',size=17)
plt.show()

<a id='topic'></a>
# Topic modelling on reviews

### Pre-process and vectorize review

In [None]:
docs= np.array(df['Review'])

In [None]:
def docs_preprocessor(docs):
    tokenizer = RegexpTokenizer(r'\w+')
    
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert to lowercase.
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

    # Remove numbers, but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove words that are only one character.
    docs = [[token for token in doc if len(token) > 3] for doc in docs]
    
    # Lemmatize all words in documents.
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs

docs = docs_preprocessor(docs)


In [None]:
# Add bigrams and trigrams to docs (only ones that appear 10 times or more).
bigram = Phrases(docs, min_count=10)
trigram = Phrases(bigram[docs])

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
    for token in trigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

### Remove rare and common tokens:
**Filtering out words that occur less than 10 documents and more than 20% of the documents we ended up with about 23% of original words**

In [None]:
# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)
print('Number of unique words in initital documents:', len(dictionary))

# Filter out words that occur less than 10 documents, or more than 20% of the documents.
dictionary.filter_extremes(no_below=10, no_above=0.2)
print('Number of unique words after removing rare and common words:', len(dictionary))

In [None]:
corpus = [dictionary.doc2bow(doc) for doc in docs]
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

## Training LDA...

LDA is an unsupervised technique, meaning that we don't know prior to running the model how many topics exits in our corpus. Four topic can be a good choice to start,and see if it is the optimal number that would separate topics the most. 

Next we use pyLDAvis tool to visualize LDA result:

In [None]:
# Set training parameters.
num_topics = 4
chunksize = 500 # size of the doc looked at every pass
passes = 20 # number of passes through documents
iterations = 400
eval_every = 1  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0] # load dictionary
id2word = dictionary.id2token

%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every,random_state=12)


In [None]:
pyLDAvis.gensim.prepare(model, corpus, dictionary)

**What do we see here?**

The left panel, labeld Intertopic Distance Map,represent different topics and the distance between them. Similar topics appear closer and the dissimilar topics farther. The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus. An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.

The right panel, include the bar chart of the top 30 terms. When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics. Selecting each topic on the right, modifies the bar chart to show the "relevant" terms for the selected topic. Relevence is defined as in footer 2 and can be tuned by parameter  λ , smaller  λ  gives higher weight to the term's distinctiveness while larger  λ s corresponds to probablity of the term occurance per topics.

Therefore, to get a better sense of terms per topic we'll use  λ =0.

<a id='eda'></a>
# Explore key aspects that make hotels good or bad

### First of all let's look at the terms that appear more in each topic

In [None]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, atopic number and topn vocabs of interest
    prints a formatted list of the topn terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))
    
    return terms

In [None]:
topic_summaries = []
print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')
for i in range(num_topics):
    print('Topic '+str(i)+' |---------------------\n')
    tmp = explore_topic(model,topic_number=i, topn=5, output=True )
#     print tmp[:5]
    topic_summaries += [tmp[:5]]

**The four topics:**
* Topic 0 : "Top Hotels/Comfortable Hotels"
**Includes words like: (minute walk,walking distance, comfortable, city ,staff friendly) that make us think that this topic it's related with City Hotels, close to center city and comfortable ( Seems with an higher rating )**

* Topic 1 : "Resort Hotels"
**Includes words like: (beach,resort, punta cana, trip ,beautiful, ocean). It's surely related with Resort hotels.**

* Topic 2 : "Worst Hotels"
**Includes words like: (desk,problem, asked ,told, check,dirty,loud). that make us think that this topic can be related with lower rating hotels, and seems that those problems are related with reservation problems,dirtiness and loudness, basically the worst hotels.**

* Topic 3 : "Business Hotels"
**Includes words like: (coffee,continental breakfast,sitting_area). This topic seems to refer to business-class hotels or something like that.**

## Let's see topic distribution by rating
**Draw your conclusions:**

In [None]:
# attach topics to df
all_topics = model.get_document_topics(corpus, minimum_probability=0.0)
all_topics_csr = gensim.matutils.corpus2csc(all_topics)
all_topics_numpy = all_topics_csr.T.toarray()
df['Topic'] = all_topics_numpy.argmax(axis=1)

# plot topics distribution by rating
plt.figure(figsize=(8,7))
sns.countplot(data=df,x="Rating",hue="Topic",edgecolor="black",linewidth=3)
plt.legend(['Top hotels',"Resort Hotels","Worst Hotels","Business Hotels"])
plt.title('Topics Distribution by rating',size=18)
plt.show()

# plot topics distribution 
plt.figure(figsize=(8,7))
ax=sns.countplot(data=df,x="Topic",edgecolor="black",linewidth=3)
ax.set_xticklabels(['Top hotels',"Resort Hotels","Worst Hotels","Business Hotels"])
plt.title('Topics Distribution',size=18)
plt.show()

<a id='model'></a>
# Predict review rating

**Suggested Metrics:**
* MAE
* RMSE

In [None]:
from sklearn.model_selection import train_test_split

X = df['Review']
y = df['Rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)


In [None]:
xgb = Pipeline([('vect', TfidfVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', XGBClassifier(objective="multi:softmax",n_estimators=300,learning_rate=0.01))
              ])

xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)

print("_"*25+"Classification Report"+"_"*25)
print(classification_report(y_pred,y_test,zero_division=0))
print("_"*25+"Evaluation Metrics"+"_"*25)
print("\n")
print("Accuracy: %f" % accuracy_score(y_pred,y_test))
print("Weighted Precision :%f" % precision_score(y_pred,y_test,average="weighted"))
print("MAE :%f" % mean_absolute_error(y_pred,y_test))
print("RMSE :%f" % mean_squared_error(y_pred,y_test,squared=False))


plt.figure(figsize=(8,7))
cm=confusion_matrix(y_pred,y_test)
g=sns.heatmap(cm,annot=True,fmt='d',linewidths=1,linecolor='black',
                  annot_kws={"size":14},cmap='Blues',cbar=False)

plt.xlabel('Actual',size=16)
plt.ylabel('Predicted',size=16)
plt.title('Confusion Matrix \n XGB Classifier',size=16)
plt.show()