This program crawls web text on a Sephora review page and returns the review ID, rating score, first name and location of the reviewer, their review text, and how many people found their review to be helpful.
Use the following pip commands in command prompt to install the necessary libraries.
pip3 install requests pip3 install lxml pip3 install beautifulsoup4 pip3 install nltk
Items are listed in order of importance.
- create a map that shows th review frequencies and ratings by states
- Filter out
peoplein array_helpful (maybe try to use replace())
Recent fixes are listed first.
- Fixed error "index out of range." The first (oldest) review did not have a rating, so we will omit this 1 entity from our data set.
- Stop skipping over hidden paragraphs (when reviews are long and you have to click "see more", the crawler skips over these parts)
- Fix 5, 4, 3, 2, 1 bug at the beginning of
rating(for some odd reason the first 5 ratings on every page come in as 5, 4, 3, 2, 1 but that is incorrect)
- Tokenzie words using nltk