# Trip Advisor Reviews

<h3>Context</h3>
This data set is scrapped from Trip Advisor for restaurants in Mumbai India. The data set took 1 day to scrape and contains 2996 rows and 13 columns. This does not contain all the restaurants in Mumbai but contains restaurants till 3.5 rating level. There are some restaurants that are duplicated by name but the address is different so these are different restaurants but belong to the same chain.

<h3>Acknowledgements</h3>
Thanks to Trip Advisor for having the dataset available to scrape.

<h3>Inspiration</h3>
This is a simple data set where you can clean the data, check for correlation, top rated restaurants, and much more. Feel free to go through the data set and please let me know if you like dataset like these as it will encourage me to scrape more such data.

In [1]:
import pandas as pd
import pickle
import nltk

tp = pd.read_csv('Trip_advisor_review.csv')
tp.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


In [33]:
tp.shape

(20491, 2)

In [34]:
tp.isnull().sum()

Review    0
Rating    0
dtype: int64

In [35]:
tp.Rating.value_counts()

5    9054
4    6039
3    2184
2    1793
1    1421
Name: Rating, dtype: int64

In [36]:
# Let's have a look at what the Reviews for Rating 3 read

tp.Review[tp.Rating == 3].head(10)

2     nice rooms not 4* experience hotel monaco seat...
13    nice hotel not nice staff hotel lovely staff q...
19    hmmmmm say really high hopes hotel monaco chos...
25    n't mind noise place great, read reviews noise...
27    met expectations centrally located hotel block...
46    pay read reviews booked knew getting, mind n't...
47    not bad location unmatchable price range, simp...
54    expensive, not biz travellers, simple fact hot...
56    okay not amazing husband stayed weekend night,...
67    ace not place husband stayed ace hotel seattle...
Name: Review, dtype: object

In [37]:
tp.loc[tp.Rating <= 3, 'Rating'] = 0 # 'Bad Rating'
tp.loc[tp.Rating >= 4, 'Rating'] = 1 # 'Good Rating'

In [38]:
# After Categorizing data into Good and Bad we see Problem of class imbalance

tp.Rating.value_counts()

1    15093
0     5398
Name: Rating, dtype: int64

In [2]:
from nltk.corpus import stopwords
import string
from nltk.stem.snowball import SnowballStemmer # stemming

stemmer = SnowballStemmer(language='english')
string.punctuation
nltk.download('stopwords')
stopwords.words('english')

In [56]:
def text_process(mess):            ### creating a function
    """                                                        ## a docstring
    1. remove the punctuation
    2. remove the stopwords
    3. return the list of clean textwords
    
    """
    nopunc = [char for char in mess if char  not   in string.punctuation]
    nopunc = "".join(nopunc)
    
    st_word = [ word for word in nopunc.split() if word not in stopwords.words("english")]
    
    stem_word = []
    
    for word in range(0, len(st_word)):
        print(st_word[word])
        stem_word.append(stemmer.stem(st_word[word]))
        
    return stem_word


In [3]:
# Word2Vec Implementation using CountVectorizer Method to create Words into a sparse matrics 

from sklearn.feature_extraction.text import CountVectorizer

bow_transformer = CountVectorizer(analyzer = text_process).fit(tp["Review"])

In [4]:
review_bow = bow_transformer.transform(tp['Review'])

In [59]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(review_bow, tp.Rating)

In [60]:
print(x_train.shape)
print(y_train.shape)

(15368, 67355)
(15368,)


In [61]:
print(x_test.shape)
print(y_test.shape)

(5123, 67355)
(5123,)


In [62]:
from sklearn.naive_bayes import MultinomialNB

nb_spam = MultinomialNB()

In [63]:
nb_spam.fit(x_train, y_train)

MultinomialNB()

In [64]:
pred_nb = nb_spam.predict(x_test)

In [71]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

tab1 = confusion_matrix(y_test, pred_nb)
print(tab1)

cl = classification_report(y_test, pred_nb)
print(cl)

[[ 905  481]
 [ 166 3571]]
              precision    recall  f1-score   support

           0       0.85      0.65      0.74      1386
           1       0.88      0.96      0.92      3737

    accuracy                           0.87      5123
   macro avg       0.86      0.80      0.83      5123
weighted avg       0.87      0.87      0.87      5123



# Random Forest Classifier

In [75]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(class_weight='balanced')

rf.fit(x_train, y_train)

RandomForestClassifier(class_weight='balanced')

In [76]:
rf_pred = rf.predict(x_test)

In [77]:
tab_rf = confusion_matrix(y_test, rf_pred)
print(tab_rf)

cl_rf = classification_report(y_test, rf_pred)
print(cl_rf)

[[ 504  882]
 [  15 3722]]
              precision    recall  f1-score   support

           0       0.97      0.36      0.53      1386
           1       0.81      1.00      0.89      3737

    accuracy                           0.82      5123
   macro avg       0.89      0.68      0.71      5123
weighted avg       0.85      0.82      0.79      5123

