# A Quick Ratings Predictor Web-App
#### https://review-rating1.azurewebsites.net/

- In this notebook we explore a dataset containing reviews of varios board-games and their respective ratings.
- With this we build a predictor that would obtain scores from given review


## Dataset
- The file that was necessary to build this system was -> games_detailed_info.csv
- Total number of reviews: 13170073
- Total Trainable Reviews (ones with ratings given along with the review): 2637756

## Methodology
- It is well known that naive bayes performs optimally with text analysis most of the time. It was because of that we decided to pursue building this system with sklearn.MultinomialNB classifier

## Cleaning Data
- We cleaned each review below by removing unwanted characers and also, removing stopwords from the functions cleanString and removStopWords

## Train Test Split
- We used 90% of the data to train the dataset and 10% of the data to test it out

## NLP Techniques used

### 1. Count Vectorizer
- We used CountVectorizer which we realized had already used many of the data cleaning methodologies that weren't necessary to be built in the first place. 
- It Built a sparse matrices of extremely high dimensional data with ease and not consuming too much RAM
- We picked 10000 features to build the multinomial model with. 

### 2. TFIDF
- we used sklearn.feature_extraction.text.TfidfTransformer to convert the countvectorized value to tfidf value which would be more impactful in building the model

## Issues we faced

### 1. Working with High Dimensional Data
- One of the biggest issues that was faced was the fact that we tried to build the best 10000 features to build a model, which happened to be extremely slow. Countvectorizer was effective at performing that with much ease. 
- The blog https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html was very useful in helping in building the model effectively

### 2. Azure
Setting up Azure had many problems by itself. It wouldn't run the main python file. We identified the problem to be the name of the main python file that was running. It had to be something in the format of "applications.py"

## Calculations

- For any input given the multinomialNB of sklearn can give you the prior probability distributions of the dimensions that are used with clf.predict_proba(tester_tfidf)
- Often we noticed that most of the time about 20% of the words from the entire review are utilized in making the decis



In [77]:


import numpy as np
import pandas as pd
import math
import os
from sklearn.metrics import mean_squared_error as mse
from sklearn.naive_bayes import MultinomialNB
from copy import deepcopy
import pickle

In [78]:
# Preprocess
# https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas
def cleanString(incomingString):
    newstring = incomingString.lower()
    newstring = newstring.replace(".","")
    newstring = newstring.replace(",","")
    newstring = newstring.replace("!","")
    newstring = newstring.replace("@","")
    newstring = newstring.replace("#","")
    newstring = newstring.replace("$","")
    newstring = newstring.replace("%","")
    newstring = newstring.replace("^","")
    newstring = newstring.replace("&","and")
    newstring = newstring.replace("*","")
    newstring = newstring.replace("(","")
    newstring = newstring.replace(")","")
    newstring = newstring.replace("+","")
    newstring = newstring.replace("=","")
    newstring = newstring.replace("?","")
    newstring = newstring.replace("\'","")
    newstring = newstring.replace("\"","")
    newstring = newstring.replace("{","")
    newstring = newstring.replace("}","")
    newstring = newstring.replace("[","")
    newstring = newstring.replace("]","")
    newstring = newstring.replace("<","")
    newstring = newstring.replace(">","")
    newstring = newstring.replace("~","")
    newstring = newstring.replace("`","")
    newstring = newstring.replace(":","")
    newstring = newstring.replace(";","")
    newstring = newstring.replace("|","")
    newstring = newstring.replace("\\","")
    newstring = newstring.replace("/","")        
    return newstring

def removeStopWords(incoming):
    stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
    all_words = incoming.split(" ")
    for sw in stopwords:
        try: all_words.remove(sw)
        except: pass
    return all_words




In [79]:

# Input data files are available in the "./boardgamegeek-reviews/" directory.

# ip = []
# for dirname, _, filenames in os.walk('./boardgamegeek-reviews'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))
#         ip.append(pd.read_csv(os.path.join(dirname, filename)))
# Any results you write to the current directory are saved as output.
game_review = pd.read_csv('./boardgamegeek-reviews/bgg-13m-reviews.csv')

In [80]:
# game_review = ip[2]
# game_detailed_info = ip[0]
# game_detail = ip[1]

In [81]:
review_rating_table = pd.DataFrame({'comment': game_review['comment'], 'rating': game_review['rating']})
print(len(review_rating_table))
review_rating_table = review_rating_table[pd.notna(review_rating_table['comment'])]
review_rating_table = review_rating_table.reset_index()
review_rating_table = review_rating_table.drop(['index'], axis=1)
print(len(review_rating_table))

13170073
2637756


In [82]:
review_rating_table.loc[2637755,'rating']

2.0

In [83]:
# Round off everything
# review_rating_table = review_rating_table.round([1])
review_rating_table['rating'] = review_rating_table['rating'].astype('int32')

In [84]:
# Reduce classes
nearest_int = [0,2,2,4,4,6,6,8,8,10,10]
for pos, val in enumerate(nearest_int):
    review_rating_table.loc[review_rating_table['rating']==pos, 'rating']=val
# for pos,val in enumerate(review_rating_table['rating']):
#     if pos%10==0:
#         print('\r', 'counter: ', pos, end='')
#     review_rating_table.loc[pos, 'rating'] = nearest_int[val]

In [85]:
review_rating_table

Unnamed: 0,comment,rating
0,"Currently, this sits on my list as my favorite...",10
1,"I know it says how many plays, but many, many ...",10
2,i will never tire of this game.. Awesome,10
3,This is probably the best game I ever played. ...,10
4,Fantastic game. Got me hooked on games all ove...,10
...,...,...
2637751,Horrible party game. I'm dumping this one!,4
2637752,Difficult to build anything at all with the in...,4
2637753,"Lego created a version of Pictionary, only you...",4
2637754,This game is very similar to Creationary. It c...,2


In [86]:
# https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas
# Split train and test using pandas
msk_train = np.random.rand(len(review_rating_table)) <= 0.9
# msk_test = np.random.rand(len(review_rating_table[~msk_train])) <= 0.1
review_rating_table_train = review_rating_table[msk_train]
review_rating_table_test = review_rating_table[~msk_train]

In [87]:
ytrain = review_rating_table_train['rating']

In [88]:
# Tokenizing text
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(min_df = 1, max_features = 10000)
X_train_counts = count_vect.fit_transform(review_rating_table_train['comment'])
X_train_counts.shape

(2374382, 10000)

In [89]:
# Save vocabulary for web application - count_vect.vocabulary_
with open('vocabulary.p', 'wb') as fp:
    pickle.dump(count_vect.vocabulary_, fp, protocol=pickle.HIGHEST_PROTOCOL)

In [90]:
# Testing load vocabulary
with open('vocabulary.p', 'rb') as fp:
    data = pickle.load(fp)
# print(data)

In [91]:
# Tfidf
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2374382, 10000)

In [92]:
# Build model with tfidf data of each word in the document wrt all documents, and y value of corresponding doc
ytrain = np.around(ytrain).astype('U')
clf = MultinomialNB().fit(X_train_tfidf, ytrain)

In [93]:
clf.class_log_prior_

array([-12.37766266,  -1.94021125,  -3.75901457,  -2.53330428,
        -1.238304  ,  -0.76842297])

# TEST DATA 
- We will run our 10% test data to find the optimal speed

In [94]:
# Take Test, tokenize, tfidf, test
# from sklearn.feature_extraction.text import CountVectorizer
count_vect_test = CountVectorizer(vocabulary=count_vect.vocabulary_)
X_test_counts = count_vect_test.fit_transform(review_rating_table_test['comment'])
X_test_counts.shape
tfidf_transformer = TfidfTransformer()
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
X_test_tfidf.shape
ytest = review_rating_table_test['rating']
ytest = np.around(ytest).astype('U')

In [95]:
prediction = clf.predict(X_test_tfidf)

In [96]:
sum(prediction==ytest)/len(ytest)

0.5233356367750803

In [97]:
# std deviation
prediction = prediction.astype(float)
ytest = ytest.astype(float)
# ytest
# What is the control 



In [98]:
# Error Root Mean Squared 
np.sqrt(mse(ytest, prediction))

1.7058651552124853

In [99]:
import pickle

In [100]:
filename = 'nb_model_final.sav'
pickle.dump(clf, open(filename, 'wb'))

In [101]:
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.predict(X_test_tfidf)

###  Predicting for a single test case and obtaining the Probability Distribution of the word to be found given the prediction

In [102]:

newReview = "Amazing crazy good brilliant excellent mad"
count_vect_test = CountVectorizer(vocabulary=count_vect.vocabulary_)
tester_counts = count_vect_test.fit_transform([newReview])
tester_counts.shape
tfidf_transformer = TfidfTransformer()
tester_tfidf = tfidf_transformer.fit_transform(tester_counts)
tester_tfidf.shape
prediction = clf.predict(tester_tfidf)
print("Prediction:", prediction)
prob = clf.predict_proba(tester_tfidf)
prob *= 100
prob.tolist()
print("Probability: ", prob)

Prediction: ['8']
Probability:  [[1.09057192e-05 4.19645887e+01 1.82915117e-01 9.24874274e-01
  7.60506813e+00 4.93225429e+01]]


In [103]:
words = newReview.split(' ')
required_words = []
count_vector_vocabulary = count_vect.vocabulary_.keys()
for word in words:
    if(word.lower() in count_vector_vocabulary):
        required_words.append(word)
print('for the respective words: ', required_words)

for the respective words:  ['Amazing', 'crazy', 'good', 'brilliant', 'excellent', 'mad']
