For this project, I decided to predict how many stars a customer is more likely to give to a specific written review.
I pull up a 2 million data set of Kindle reviews, but due to computational limitations I was forzed to use only a random 10% sample of this dataset to work on.
Even though this project was challenging for me, it helped me to push myself to untaught themes like Natural Language Processing NLP. So this job was done focusing more on the Statistical tool rather than on NLP tools.
I deployed, as a stretch goal, an app on Heroku, where you can take a shot of how this works. You can see it here.
Some highlights of the analysis of the data could be read in my blog post.
Here is the final notebooks and some pickles I had to make, due the size of the data set.
- gzip
- json
- pandas as pd
- numpy as np
- matplotlib.pyplot as plt
- urllib.request.urlopen
- string
- seaborn
- pickle
- Heroku
This is the very first version. Sometime I'd like to use more powerfool NLP tools to compare the new results.