ReviewComparer

To compare reviews on amazon using data mining.

Usage Clone the repository and then run

python ReviewComparer.py "product search term"

eg

python ReviewComparer.py "laptop computer"

This will compare the top 4 results for the given search term.

or

python ReviewComparer.py -l ASIN1 ASIN2 ASIN3 ... (for an arbitrary number of ASIN's)

This will compare the products of the given ASIN's.

ASIN stands for Amazon Standard Identification Number. ASIN's are 10 character codes that can be found in every Amazon product url

                                                                        This part
                                                                        |        |    
https://www.amazon.com/Dell-OptiPlex-Processor-Certified-Refurbished/dp/B01AWOAUJY/

Some other examples are

B01AWOAUJY
B01CV9G1BO
B00IOTZGOE
B07BJMS28D

Example output

python ReviewComparer.py "blender"

Results for Ninja Professional Blender (BL610):
5 star(s): ['sharp', 'up_to_th', 'yes', 'margarita', 'later']
4 star(s): ['vita', 'finger', 'reli', 'battl', 'minc']
3 star(s): ['buy_it_again', 'chia', 'catch', 'but_i_wouldn_t', 'graini']
2 star(s): ['bp', 'because_i_h', 'have_to_do_that', 'worked_great', 'ave']
1 star(s): ['let_m', 'shelf', 'clip', 'blender_i_have_ev', 'blender_for_what_you_pay']
Results for Hamilton Beach (52400) Blender with 32 Oz Jar, For:
5 star(s): ['me_and', 'milkshak', 'mini', 'sleek', 'is_awesom']
4 star(s): ['bottl', 'while_it', 'while_it_', 'switch', 'shred']
3 star(s): ['with_this_item', 'however_the_blad', 'bonus', 'not_veri', 'suck']
2 star(s): ['i_m_sure_it_', 'not_very_happy_with_thi', 'blender_for_about_3_month', 'luck', 'this_was_a_great_littl']
1 star(s): ['not_recommend', 'smoke', 'about_the_only_thing_it_is_good_for', 'that_someday_i_wil', 'and_got']
Results for Nutri Ninja Pro Blender, Silver (BL456):
5 star(s): ['recip', 'book', 'eat', 'drive', 'love_it_']
4 star(s): ['the_bullet', 'overtighten', 'give_this_product', 'glitch', 'frozen_fruits_and_it_shr']
3 star(s): ['corrug', 'chainsaw', 'heck', 'encount', 'have_yet_to']
2 star(s): ['sharkninja', 'the_unit', 'and_now_i_have_to_play', 'i_mad', 'i_purchased_this_item']
1 star(s): ['dead', 'lousi', 'but_the_motor', 'condit', 'spark']
Results for BLACK+DECKER Countertop Blender with 5-Cup Glass J:
5 star(s): ['i_got_it_for_mi', 'every_day_and_it_s_stil', 'dad', 'and_i_dont', 'crank']
4 star(s): ['four_star', 'exactly_what_i_thought', 'great_but_it', 'i_like_the_best', 'great_blender_for_the_money']
3 star(s): ['not_as_good_as_a', 'of_their', 'macho', 'you_have_to_be_veri', 'while_it_do']
2 star(s): ['that_it_does_not', 'caus', 'on_the_blend', 'do_a_very_good_job', 'languag']
1 star(s): ['was_not', 'deck', 'siev', 'of_them', 'brand_and_th']

search_eval

Implemented with BM25 method imported from Metapy, Search-eval provides a list of top five words(queries) with highest ranking scores corresponding to each corpus, incorporating with stop-words removal and stemming filter to avoid common words, numbers and words with same roots. The third argument is the number of words you hope to get for each review corpus.

Usage

python search_eval.py word_num

Example 1

python search_eval.py 5

[['vitamix', 'food', 'pulver', 'lock', 'blade'], ['vita', 'heat', 'dull', 'fiber', 'sharp'], None, None, ['clip', 'broken', 'broke', 'warranti', 'explod'], ['18oz', '24oz', 'oz', '32oz', '16oz'], ['attach', 'small', 'milk', 'aspect', 'touch'], None, ['weird', 'fli', 'burnt', 'usual', 'explod'], ['ad', 'turn', 'move', 'got', 'if_it'], ['seal', 'common', 'recip', 'pro', 'nutri'], ['dirt', 'hi', 'ur', 'cool', 'fabric'], ['packag', 'corrug', 'box', 'amazon', 'planet'], ['the_unit', 'unit', 'to_se', 'abil', 'strip'], ['plan', 'rubber', 'smell', 'nutri', '2nd'], ['great_valu', 'thick', 'loosen', 'slip', 'ici'], ['consist', 'shake', 'ding', 'time_it', 'coconut'], ['macho', 'expect', 'tool', 'loud', 'cheap'], None, ['applianc', 'die', 'the_box', 'proper', ‘dollar']]

Example 2

python search_eval.py 10

[['vitamix', 'food', 'pulver', 'lock', 'blade', 'click', 'pour', 'ninja', 'cleanli', 'clean'], ['vita', 'heat', 'dull', 'fiber', 'sharp', 'notic', 'veggi', 'reli', 'appl', 'comparison'], None, None, ['clip', 'broken', 'broke', 'warranti', 'explod', 'defect', 'pay', 'state', 'usag', 'lock'], ['18oz', '24oz', 'oz', '32oz', '16oz', 'shake', 'kid', 'milkshak', 'fashion', 'opportun'], ['attach', 'small', 'milk', 'aspect', 'touch', 'lot', 'option', 'switch', 'power', 'ingredi'], None, ['weird', 'fli', 'burnt', 'usual', 'explod', 'smoke', 'a_month', 'start', 'morn', 'piec'], ['ad', 'turn', 'move', 'got', 'if_it', 'back', 'the_first', 'wast', 'middl', 'mother'], ['seal', 'common', 'recip', 'pro', 'nutri', 'almond', 'review', 'this_product', 'fill', 'good'], ['dirt', 'hi', 'ur', 'cool', 'fabric', 'explain', 'mark', 'black', 'came', 'color'], ['packag', 'corrug', 'box', 'amazon', 'planet', 'shop', 'less_than_idea', 'seven', 'larger', 'able_to_mak'], ['the_unit', 'unit', 'to_se', 'abil', 'strip', 'did_not', 'told', 'was_not', 'sale', 'unfortun'], ['plan', 'rubber', 'smell', 'nutri', '2nd', 'protect', 'fool', 'manufactur', 'the_compani', 'serial'], ['great_valu', 'thick', 'loosen', 'slip', 'ici', 'strength', 'to_settl', 'whirlpool', 'the_qual', 'gear'], ['consist', 'shake', 'ding', 'time_it', 'coconut', 'delici', 'scrub', 'handl', 'hundr', 'strongest'], ['macho', 'expect', 'tool', 'loud', 'cheap', 'leak', 'basic', 'job', 'super', 'bueno'], None, ['applianc', 'die', 'the_box', 'proper', 'dollar', 'arriv', 'understat', 'dishrag', 'quit', ‘cream']]

Note that Amazon lacks an official API to grab their reviews so we get all of our reviews from scraping their website. This can take some time due to sleep statements that we have purposely placed between queries to amazon in order to prevent getting ratelimited by their website. We also limit the number of reviews we scrape for each product to the top 100 most useful as rated by amazon, but this can be changed by altering kLimitPages in ReviewComparer.py

This output shows the kinds of words that best differentiate each product, rating pair from other product rating pairs. Results are sorted by their BM25 scores where the queries are the words and the document is all of the reviews of the product, rating pair.

The phrases used by phrase_segmentation are generated from https://github.com/shangjingbo1226/AutoPhrase Their code is applied to a corpus of all reviews that we have collected.

People who worked on this project

cshang4, jianing2, siqig2 (@illinois.edu)

Amazon scraping and phrase mining was done by cshang4.

Implementing BM25 and interpreting the results was done by jianing2 and siqig2.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
comparison_data		comparison_data
phrase_segmentation		phrase_segmentation
reviews		reviews
.gitignore		.gitignore
README.md		README.md
ReviewComparer.py		ReviewComparer.py
amazon_search.py		amazon_search.py
config.toml		config.toml
custom_parser.py		custom_parser.py
phrase_processor.py		phrase_processor.py
queries.txt		queries.txt
result.txt		result.txt
review_to_text.py		review_to_text.py
reviews1.p		reviews1.p
search_eval.py		search_eval.py
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ReviewComparer

search_eval

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Creyslz/ReviewComparer

Folders and files

Latest commit

History

Repository files navigation

ReviewComparer

search_eval

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages