# Obtain Data

The Main data source was scraped from TripAdvisor, a popular travel review website, using Scrapy. I decided on scraping all hotel/resort reviews from Punta Cana, a Caribbean vacation destination that is rising in popularity. 

The Scrapy spider crawled and scraped all the data into a JSON format, although the framework allows for item pipelining into a MongoDB database.

Please see /src/tripdadvisor_reviews for the Scrapy source code.

# Scrub Data

This project is an NLP project, and therefore scrubbing the data takes on a different path than a supervised learning project. 

After importing the data, I have to take these steps:

1. Clean the data
2. Tokenize the data
3. Vectorize the data

In [308]:
# Imports
import pandas as pd
import numpy as np
import json
from nltk import word_tokenize
from nltk.util import ngrams
import spacy

import en_core_web_md
spacy_nlp = en_core_web_md.load()

In [309]:
# Decorator Functions
from functools import wraps

def my_logger(orig_func):
    import logging
    logging.basicConfig(filename='{}.log'.format(orig_func.__name__), level=logging.INFO)

    @wraps(orig_func)
    def wrapper(*args, **kwargs):
        logging.info(
            'Ran with args: {}, and kwargs: {}'.format(args, kwargs))
        return orig_func(*args, **kwargs)

    return wrapper


def my_timer(orig_func):
    import time

    @wraps(orig_func)
    def wrapper(*args, **kwargs):
        t1 = time.time()
        result = orig_func(*args, **kwargs)
        t2 = time.time() - t1
        print('{} ran in: {} sec'.format(orig_func.__name__, t2))
        return result

    return wrapper

def remove_list_and_rearrange(df):
    columns = list(df.columns)
    df = df.applymap(lambda x: x if not isinstance(x, list) else x[0] if len(x) else '')
    return df[["hotel","title","content","stars"]]

def tokenize_and_grams(df,list_of_n):
    df['tokens'] = df.apply(lambda row: word_tokenize(row['content'].lower()), axis=1)
    tokens = df['tokens'].to_list()
    try:
        grams={}
        for x in list_of_n:
            df[f'{x}gram'] = df['tokens'].apply(lambda row: list(ngrams(row,x)))
            grams[f'{x}gram'] = df[f'{x}gram'].to_list()
                
    except TypeError as E:
        print("Please input a list of numbers.")
    
    return tokens,grams

def spacy_tokenize():
    

SyntaxError: unexpected EOF while parsing (<ipython-input-309-50591e541482>, line 49)

In [250]:
reviews = pd.read_json("../data/raw/all.json")

In [251]:
reviews = remove_list_and_rearrange(reviews)

## Tokenize and N-grams

First we use NLTK to tokenize and apply 2- and 3-grams for each review.

In [252]:
tokens, grams = tokenize_and_grams(reviews,[2,3])

In [253]:
reviews.head()

Unnamed: 0,hotel,title,content,stars,tokens,2gram,3gram
0,The Reserve at Paradisus Punta Cana,Family concierge stay,Hands down great family vacation. Estefanía ou...,5,"[Hands, down, great, family, vacation, ., Este...","[(Hands, down), (down, great), (great, family)...","[(Hands, down, great), (down, great, family), ..."
1,Paradisus Punta Cana Resort,Sunrise bartender,Juan Batistae has got to be the best host/bart...,5,"[Juan, Batistae, has, got, to, be, the, best, ...","[(Juan, Batistae), (Batistae, has), (has, got)...","[(Juan, Batistae, has), (Batistae, has, got), ..."
2,Dreams Palm Beach Punta Cana,Overall Excellent Experience!,This was our first trip! We went with our frie...,5,"[This, was, our, first, trip, !, We, went, wit...","[(This, was), (was, our), (our, first), (first...","[(This, was, our), (was, our, first), (our, fi..."
3,Now Onyx Punta Cana,Lovely 10 days,Me and my partner travelled from Glasgow Scotl...,5,"[Me, and, my, partner, travelled, from, Glasgo...","[(Me, and), (and, my), (my, partner), (partner...","[(Me, and, my), (and, my, partner), (my, partn..."
4,Paradisus Punta Cana Resort,Amazing spring break vacation✌️👌👨‍👩‍👧‍👦😍😍👍,We had amazing time with family and friends. A...,4,"[We, had, amazing, time, with, family, and, fr...","[(We, had), (had, amazing), (amazing, time), (...","[(We, had, amazing), (had, amazing, time), (am..."


## Clean the text

This requires some preprocessing to do.

# Explore Data

In [153]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## CountVectorizer

In [264]:
reviews_label = [x+": "+y for x,y in zip(reviews.hotel.to_list(),reviews.title.to_list())]

In [271]:
count_vectorizer = CountVectorizer(ngram_range=(1, 3),  
                                   stop_words='english', token_pattern="\\b[a-z][a-z]+\\b")
doc_word = count_vectorizer.fit_transform(reviews.content.to_list())
pd.DataFrame(doc_word.toarray(), index=reviews_label, columns=count_vectorizer.get_feature_names()).head(10)

Unnamed: 0,aa,aa batteries,aa batteries diamond,aaa,aaa diamond,aaa diamond rating,aaa simon,aaa simon exhale,aaa travel,aaa travel agent,...,zumba workouts definitely,zumba yoga,zumba yoga carlos,zumbar,zumbar discovery,zumbar discovery kid,zumbar service,zumbar service jansel,zumbar wilson,zumbar wilson jenny
The Reserve at Paradisus Punta Cana: Family concierge stay,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Paradisus Punta Cana Resort: Sunrise bartender,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Dreams Palm Beach Punta Cana: Overall Excellent Experience!,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Now Onyx Punta Cana: Lovely 10 days,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Paradisus Punta Cana Resort: Amazing spring break vacation✌️👌👨‍👩‍👧‍👦😍😍👍,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Dreams Palm Beach Punta Cana: Great vacation hotel,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Now Onyx Punta Cana: Paradise personified,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Grand Palladium Punta Cana Resort & Spa: Excellent stay,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Grand Palladium Punta Cana Resort & Spa: Family Vacation 2019,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Four Points by Sheraton Puntacana Village: Pit stop,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## TF-IDF

## Reduce Dimensionality

# Modeling Data

In [286]:
import gensim
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity
from gensim import corpora, models, similarities, matutils

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [310]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        
dim_red_choices = {
    '1': NMF,
    '2': TruncatedSVD,
    '3': models.LdaModel
}

def dim_reduction_modeling(model_num, num_topics, doc_word):
    if model_num == 1 or model_num == 2:
        model = dim_red_choices[model_num](num_topics)
        doc_topic = model.fit_transform(doc_word)
    else:
        model = dim_red_choices[model_num](corpus=doc_word, num_topics=num_topics)
    
    
    
    return model

## NMF

In [294]:
nmf_model = NMF(7)
doc_topic = nmf_model.fit_transform(doc_word)

In [295]:
topic_word = pd.DataFrame(nmf_model.components_.round(3),
             index = ["component_1","component_2","component_3"
                     ,"component_4","component_5","component_6"
                      ,"component_7"],
             columns = count_vectorizer.get_feature_names())
topic_word

Unnamed: 0,aa,aa batteries,aa batteries diamond,aaa,aaa diamond,aaa diamond rating,aaa simon,aaa simon exhale,aaa travel,aaa travel agent,...,zumba workouts definitely,zumba yoga,zumba yoga carlos,zumbar,zumbar discovery,zumbar discovery kid,zumbar service,zumbar service jansel,zumbar wilson,zumbar wilson jenny
component_1,0.0,0.0,0.0,0.001,0.0,0.0,0.0,0.0,0.003,0.004,...,0.002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
component_2,0.001,0.001,0.001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.001,0.001,0.001,0.0,0.0,0.0,0.0
component_3,0.003,0.003,0.003,0.005,0.001,0.001,0.001,0.001,0.002,0.0,...,0.0,0.0,0.0,0.004,0.002,0.002,0.001,0.001,0.001,0.001
component_4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
component_5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
component_6,0.0,0.0,0.0,0.001,0.0,0.0,0.0,0.0,0.002,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
component_7,0.0,0.0,0.0,0.002,0.001,0.001,0.0,0.0,0.0,0.0,...,0.0,0.004,0.004,0.006,0.007,0.007,0.0,0.0,0.0,0.0


In [296]:
display_topics(nmf_model, count_vectorizer.get_feature_names(), 5)


Topic  0
great, staff, time, service, amazing

Topic  1
room, day, got, told, check

Topic  2
resort, staff, amazing, beautiful, time

Topic  3
beach, pool, bar, area, day

Topic  4
hotel, staff, stay, friendly, rooms

Topic  5
cana, punta, punta cana, time, stay

Topic  6
good, food, nice, really, restaurant


In [281]:
H = pd.DataFrame(doc_topic.round(5),
             index = reviews_label,
             columns = ["component_1","component_2","component_3","component_4" ])
H

Unnamed: 0,component_1,component_2,component_3,component_4
The Reserve at Paradisus Punta Cana: Family concierge stay,0.22654,0.01346,0.02335,0.00771
Paradisus Punta Cana Resort: Sunrise bartender,0.01959,0.01964,0.01124,0.00740
Dreams Palm Beach Punta Cana: Overall Excellent Experience!,0.21521,0.01705,0.11019,0.05638
Now Onyx Punta Cana: Lovely 10 days,0.00000,0.01803,0.00742,0.00617
Paradisus Punta Cana Resort: Amazing spring break vacation✌️👌👨‍👩‍👧‍👦😍😍👍,0.11171,0.01040,0.01113,0.00000
Dreams Palm Beach Punta Cana: Great vacation hotel,0.18452,0.02787,0.04903,0.00000
Now Onyx Punta Cana: Paradise personified,0.02466,0.00117,0.01692,0.00000
Grand Palladium Punta Cana Resort & Spa: Excellent stay,0.04010,0.04938,0.14491,0.01256
Grand Palladium Punta Cana Resort & Spa: Family Vacation 2019,0.05017,0.00000,0.05712,0.05286
Four Points by Sheraton Puntacana Village: Pit stop,0.00000,0.00000,0.00000,0.16900


## LSA

In [291]:
lsa = TruncatedSVD(3)
doc_topic = lsa.fit_transform(doc_word)
lsa.explained_variance_ratio_

array([0.02409903, 0.0087514 , 0.00709764])

In [292]:
topic_word = pd.DataFrame(lsa.components_.round(3),
             index = ["component_1","component_2","component_3"],
             columns = count_vectorizer.get_feature_names())
topic_word

Unnamed: 0,aa,aa batteries,aa batteries diamond,aaa,aaa diamond,aaa diamond rating,aaa simon,aaa simon exhale,aaa travel,aaa travel agent,...,zumba workouts definitely,zumba yoga,zumba yoga carlos,zumbar,zumbar discovery,zumbar discovery kid,zumbar service,zumbar service jansel,zumbar wilson,zumbar wilson jenny
component_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
component_2,-0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,...,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0
component_3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-0.0,-0.0,0.001,0.0,0.0,0.0,0.0,0.0,0.0


In [293]:
display_topics(lsa, count_vectorizer.get_feature_names(), 5)


Topic  0
resort, beach, room, great, staff

Topic  1
room, hotel, told, day, got

Topic  2
resort, room, time, day, told


## LDA

# Interpret Data