<h3>REVIEWS RATING ESTIMATOR</h3>

<p> This is a project that estimates ratings based on customer reviews.The purpose of such an application is to help identify ratings for reviews where the user ratings are missing. This rating could be used to further analyze the success rate of your product.</p><br>
<p>The dataset used for this project is a boardgamegeek review dataset available at Kaggle here: 
    <a href="https://www.kaggle.com/jvanelteren/boardgamegeek-reviews">Dataset</a> This model will also work for other kinds of dataset for e.g. Movie review dataset for IMDB, Restaurant review dataset etc. The various steps involved in the creation of the model are described in detail over the ipynb.</p>

<h3> Reading the data </h3>
<p>The first step is to read the data and convert it to a pandas dataframe</p>

In [None]:
import pandas as pd

df = pd.read_csv('D:/UTA/Fall-2020/DM/TermProject/archive/bgg-15m-reviews.csv')
del df['Unnamed: 0']


<h4>Dropping rows that have missing comments(NaN)</h4>

<p>This step is essential because we have to get a dataset that contains reviews in order to train our model. So removal of ratings without review is executed.</p>

In [None]:
#Dropping rows with missing reviews 
temp_dataset = df.dropna().reset_index(drop=True)
temp_dataset


<p>Use the 'Reviews' and the corresponding 'Rating' columns as the dataset</p>

In [None]:
dataset = temp_dataset[['comment', 'rating']].copy()
dataset.columns = ['Reviews','Rating']
dataset

<h3> PREPROCESSING </h3>

<p>For preprocessing and cleaning the data in the reviews column, NLTK libraries are used for removing punctuations, stopwords, numbers and lemmatization of the filtered reviews.</p>

In [None]:
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

def sentence_tokenize(text):
    sentences = nltk.sent_tokenize(text)
    return sentences
    

def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    new_text = re.sub(pattern, '', text)
    return new_text

def remove_numbers(text):
    text = re.sub('\w*\d\w*', "",text)
    return text

def word_tokenize(text):
    #remove punctuations
    tokeniser = nltk.tokenize.RegexpTokenizer(r'\w+')
    tokens = (tokeniser.tokenize(text))
    return tokens
    
def lemmatization(text):
    lemmatiser = WordNetLemmatizer()
    lemmas = [lemmatiser.lemmatize(token, pos='v') for token in text]
    return lemmas
    
def remove_stopwords(text):
    stopwords = nltk.corpus.stopwords.words('english')
    new_text = [word for word in text if word not in stopwords]
    return new_text

def list_to_string(str2):  
    str1 = " "   
    return (str1.join(str2)) 
    

<h4>At this step changing the words in the reviews to lowercase, removing of certain patterns and numbers is performed</h4>

In [None]:
lower_case_dataset = pd.DataFrame(dataset.Reviews.apply(lambda x: x.lower()))
reviews_without_htmltags_df =  pd.DataFrame(lower_case_dataset.Reviews.apply(lambda x: remove_html_tags(x)))
reviews_without_htmltags_df =  pd.DataFrame(reviews_without_htmltags_df.Reviews.apply(lambda x: remove_numbers(x)))
reviews_without_htmltags_df

<h4>Here sentence tokenization,word tokenization and removal of stop words is performed</h4>

In [None]:
reviews_without_htmltags_df['Reviews_sentence_tokenized'] = reviews_without_htmltags_df['Reviews']
reviews_without_htmltags_df['Reviews_sentence_tokenized'] = reviews_without_htmltags_df['Reviews'].apply(lambda x: sentence_tokenize(x))

#word tokenizing
reviews_without_htmltags_df['Reviews_word_tokenized'] = reviews_without_htmltags_df['Reviews']
reviews_without_htmltags_df['Reviews_word_tokenized'] = reviews_without_htmltags_df['Reviews'].apply(lambda x: word_tokenize(x))

#removing stop words
reviews_without_htmltags_df['Reviews_without_stopwords'] = reviews_without_htmltags_df['Reviews_word_tokenized'].apply(lambda x: remove_stopwords(x))
reviews_without_htmltags_df

<h4>Lemmatization</h4>
<p>Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. </p>

In [None]:
#performing lemmatization as a preprocessing step
reviews_lemmatized = pd.DataFrame(reviews_without_htmltags_df['Reviews_without_stopwords'].apply(lambda x: lemmatization(x)))

In [None]:
reviews_lemmatized.columns = ['Reviews']
reviews_lemmatized

In [None]:
dataset['Reviews'] = reviews_lemmatized['Reviews'].apply(lambda x: list_to_string(x))
#dataset['Reviews'] = reviews_lemmatized

<h4>Processed Dataset</h4>
<p>The ratings are rounded to the nearest integer to get a rating in the scale of 1-10. This cleaned and pre-processed dataset is then used for training and testing our algorithm</p>

In [None]:
import numpy as np
#Round the ratings to the nearest integer
newdf = dataset['Rating'].astype(np.int64)
dataset['Rating'] = newdf
#final dataset with pre-processed reviews
dataset

<h3> Splitting dataset into train and test </h3>
<p>We split the dataset in the ratio 4:1 to get the training and the test dataset respectively</p>

In [None]:
import numpy as np

train_df,test_df = np.split(dataset, [int(.8*len(dataset))])
print("training: ",train_df.shape)
print("test: ",test_df.shape)
Y_train = train_df['Rating'].tolist() #ratings for the train dataset
Y_test = test_df['Rating'].tolist() # ratings for the test dataset

In [None]:
# from sklearn.feature_extraction.text import TfidfVectorizer
# corpus = train_df['Reviews'].tolist()
# vectorizer = TfidfVectorizer(analyzer = 'word',use_idf = True)
# X = vectorizer.fit_transform(corpus)
# print(vectorizer.get_feature_names())
# print(X.shape)

<h3>CREATING DOCUMENT MATRIX</h3>
<p>In order to get a vocabulary of words with their frequencies we need to utilize sklearn's <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">CountVectorizer</a>. This gives us a dictionary of words with their corresponding frequencies in each document in vector form. We need the output in a matrix form in order to pass it as an input to our models. </p>
<p>Here the max_features is set to 5000 to get a better accuracy and ignore the less frequent words</p>


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# Create an instance of CountfVectorizer
vectoriser = CountVectorizer(max_features=5000) # max features is set to 5000 for better accuracy
# Fit to the data and transform to feature matrix
X_train = vectoriser.fit_transform(train_df['Reviews'])
X_test = vectoriser.transform(test_df['Reviews'])

<h4>This is the input of our model in matrix form.</h4>

In [None]:
print(X_train)

<h3> NAIVE BAYES MODEL IMPLEMENTATION</h3>
<p>This classifier has two probabilities: P(class) which is the probability an input will produce a certain class, and P(input_condition|class) is the probability an input feature has a certain value, given the class. Otherwise, default probability is 0. Multinomial Naïve bayes implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts</p>
<p>The <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html"> multinomial naive bayes model</a> provided by sklearn is implemented here. At first the model was executed with default alpha= 1. After performing hyperparameter tuning the alpha is updated to get the best accuracy 27.3%.</p>

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha = 1.0e-10)
clf.fit(X_train, Y_train)

Y_pred = clf.predict(X_test)#testing the predictions for test dataset once model has been trained
test_df['Predictions'] = Y_pred
print(test_df)
accuracy = clf.score(X_test, np.array(Y_test))
print("Accuracy of test data predictions: ",accuracy*100,"%")

<h3>PERFORMANCE EVALUATION </h3>
<p>For evaluation of the algorithm: accuracy and mean squared error is used as performance measure. Since this is a classification for 10 different classes, the accuracy can be low. So the mean squared error will give us idea how close to the original rating was our predicted rating.</p>

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import numpy as np
Y_true = np.array(Y_test)
mse = mean_squared_error(Y_true, Y_pred)
print("Accuracy:",accuracy_score(Y_true, Y_pred)*100,"%")
print("Mean Squared error:",mse)

<h3> HYPERPARAMETER TUNING </h3>
<p>Hyperparameter tuning is done on the smoothing parameter alpha for Multinomial Naive Bayes. The best accuracy obtained against the best alpha value is then used in training the final model.5 fold cross validation is performed and the best accuracy obtained can be seen as 30.9% for this model.For this process sklearn GridSearchCV method is used.</p>
<p>Best results are obtained for alpha=0.0 but recommended alpha value is 1.0e-10 so that is used for training the algorithm</p>

In [None]:
from sklearn.model_selection import GridSearchCV
params = {'alpha': np.array(np.linspace(0,1,100))}
multinomial_nb_grid = GridSearchCV(MultinomialNB(), param_grid=params, n_jobs=3, cv=5, verbose=5,scoring='accuracy')
multinomial_nb_grid.fit(X_train, Y_train)
print('Train Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_test, Y_true))
print('Best Accuracy Through Grid Search : %.3f'%multinomial_nb_grid.best_score_)
print('Best Parameters : ',multinomial_nb_grid.best_params_)
results_NB = pd.DataFrame(multinomial_nb_grid.cv_results_['params'])
results_NB['test_score'] = multinomial_nb_grid.cv_results_['mean_test_score']
results_NB

<h4>Hyperparameter tuning plot</h4>

In [None]:
#ind = params['alpha'].index(multinomial_nb_grid.best_params_['alpha'])
ind = np.where(params['alpha'] == multinomial_nb_grid.best_params_['alpha'])
print(ind)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (15,7)

fig, ax = plt.subplots(1) 
ax.plot(results_NB['alpha'], results_NB['test_score'],'ro-')
ax.set_title('Hyperparameter Tuning')
ax.set(xlabel='Alpha', ylabel='Accuracy')
ax.set_xticks(ind)
ax.set_xticklabels(["Min"])
plt.legend(loc="upper right")

In [None]:
# from sklearn import svm
# clf = svm.SVC()
# clf.fit(X_train, train_df['Rating'])
# accuracy = clf.score(X_test, y)
# print(accuracy)

<h3>Rating estimation for a sample review</h3>

In [None]:
review = input("Enter a review:")
X_test = vectoriser.transform([review])
pred = clf.predict(X_test)
print("The estimated rating is: ", str(pred[0]))
print(clf.predict_proba(vectoriser.transform([review])))

<h3>Training with the complete dataset</h3>
<p>Now that we have a hyper parameter to get the best results the entire dataset is used to train the model to estimate ratings in the application for best results</p>

In [None]:
X_train_final = vectoriser.fit_transform(dataset['Reviews'])

In [None]:
clf = MultinomialNB(alpha = 1.0e-10)
clf.fit(X_train_final, dataset['Rating'])

<h3>Exporting vectorizer and model to use in deployment server</h3>
<p>Pickle is used to save our model to be used externally. The mentioned files are used in pythonanywhere along with the deployed application.</p>

In [None]:
import pickle
pickle.dump(clf, open('D:/UTA/Fall-2020/DM/TermProject/NaiveBayesClassifier', 'wb'))
with open('D:/UTA/Fall-2020/DM/TermProject/Vectorizer', 'wb') as fin:
        pickle.dump(vectoriser, fin)

<h3>Challenges faced</h3>

 - Due to large dataset model execution time was very high so could not implement SVM for the entire dataset. 
 - Dataset had missing values, reduced the dataset size by removing rows with NaN reviews.
 - The pre-processing step was removed after it was observed that the accuracy is improved by 1% without pre-processing like lemmatization and stop words removal.

<h3>References</h3>

 - Sklearn documentation: [https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
 - [https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
 - [https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/](https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/)

<p>I did not use any references for the model implementation. Just followed official documentation above and process followed in assignment 3.</p>

<h3>Links</h3>

 - Blog post : https://pxm5568.uta.cloud/img/Maitreyee_02.html 
 - Working model is deployed at: [http://pragnyam.pythonanywhere.com/](http://pragnyam.pythonanywhere.com/ )
 - GitHub link : [https://github.com/Pragnyashree/RatingEstimator](https://github.com/Pragnyashree/RatingEstimator)
  