Sentiment Analysis and Aspect classification for Hotel Reviews

This is the source code of MonkeyLearn's series of posts related to analyzing sentiment and aspects from hotel reviews using machine learning models. This code runs in python2.7.

(May 2018 update -- TripAdvisor and Booking.com have changed their sites greatly since these spiders were written, and as such, they no longer work. The blog posts and code are still very useful as an example on how to build a Scrapy spider, but sadly, the examples themselves are no longer functional. We will probably fix the spiders in the future, since it's probably enough to update all the selectors to get everything working again.)

Code organization

The project itself is a Scrapy project that is used to gather training and testing data from different sites like TripAdvisor and Booking. Besides, there are a series of Python scripts and Jupyter notebooks that implement some necessary scripts.

Creating a sentiment analysis model with Scrapy and MonkeyLearn

The TripAdvisor (hotel_sentiment/spider/tripadvisor_spider.py) spider is used to gather data to train a sentiment analysis classifier in MonkeyLearn. Reviews texts are used as the sample content and reviews stars are used as the category (1 and 2 stars = Negative, 4 and 5 stars = Positive).

To crawl ~15000 items from tripadvisor use:

scrapy crawl tripadvisor -o itemsTripadvisor.csv -s CLOSESPIDER_ITEMCOUNT=15000

You can check out the generated machine learning sentiment analysis model here.

Aspect Analysis from reviews using Machine Learning

The Booking spider (hotel_sentiment/spider/booking_spider.py) is used to gather data to train an aspect classifier in MonkeyLearn. The data obtained with this spider can be manually tagged with each aspect (eg: cleanliness, comfort & facilities, food, internet, location, staff, value for money) using MonkeyLearn's Sample tab or an external crowd sourcing service like Mechanical Turk.

To crawl from booking use:

scrapy crawl booking -o itemsBooking.csv

You first have to add the url of a starting city. To crawl from a single hotel in booking use:

scrapy crawl booking_singlehotel -o <hotel name>.csv

opinionTokenizer.py is a simple script to obtain the "opinion units" from each review.
classify_and_plot_reviews.ipynb is a simple script that uses the generated model to classify new reviews and then plot the results in a graph using Plotly.

You can check out the generated machine learning aspect classifier here.

Machine Learning over 1M hotel reviews finds interesting insights

To crawl from Tripadvisor use:

scrapy crawl tripadvisor_more -a start_url="http://some_url" -o <hotel_name>.csv -s CLOSESPIDER_ITEMCOUNT=20000

With the url of a starting city to crawl from, such as https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html.

The scripts and notebooks necessary to replicate the post are in the classify_elastic folder:

classify_elastic/generate_files_for_indexing.py will take the csv file produced by scrapy and generate two files that other scripts will use.
classify_elastic/classify_pipe.py will open the opinion_units file and classify it with MonkeyLearn according to topic and sentiment, and save the results to a new csv file.
classify_elastic/index_definition.json contains the mapping definitions used in ElasticSearch.
classify_elastic/index_reviews.py will index into your ElasticSearch instance the reviews generated by generate_files_for_indexing.py.
classify_elastic/index_opinion_units.py will index into your ElasticSearch instance the classified opinion units.
classify_elastic/Extract keywords.ipynb shows how to extract keywords from the indexed data.

Finally, the queries folder contains some queries that were used to power the Kibana visualization.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
classify_elastic		classify_elastic
hotel_sentiment		hotel_sentiment
.gitignore		.gitignore
README.md		README.md
classify_and_plot_reviews.ipynb		classify_and_plot_reviews.ipynb
csv_monkey_converter.py		csv_monkey_converter.py
opinionTokenizer.py		opinionTokenizer.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classify_elastic

classify_elastic

hotel_sentiment

hotel_sentiment

.gitignore

.gitignore

README.md

README.md

classify_and_plot_reviews.ipynb

classify_and_plot_reviews.ipynb

csv_monkey_converter.py

csv_monkey_converter.py

opinionTokenizer.py

opinionTokenizer.py

scrapy.cfg

scrapy.cfg

Repository files navigation

Sentiment Analysis and Aspect classification for Hotel Reviews

Code organization

Creating a sentiment analysis model with Scrapy and MonkeyLearn

Aspect Analysis from reviews using Machine Learning

Machine Learning over 1M hotel reviews finds interesting insights

About

Releases

Packages

Contributors 3

Languages

monkeylearn/hotel-review-analysis

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis and Aspect classification for Hotel Reviews

Code organization

About

Resources

Stars

Watchers

Forks

Languages