Hello!

When looking at reviews online is the body of text or the title the most helpful information in determining the rating of the product? 

This is kernel is meant to demonstrate a typical Machine Learning process flow as well as provide insight into online reviews. I had not had much experience with NLP and was excited to give this a try.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
% matplotlib inline
import seaborn as sns

import os
print(os.listdir("../input"))

Import our usual suspects and look at the data,  We should check the data dictionary to gain more information about each column before starting EDA.

In [None]:
df = pd.read_csv('../input/GrammarandProductReviews.csv')
df.head(3)

In [None]:
df.info()

We will be focusing on how to predict user review scores from the data. Becasue of this we will start our EDA by seperatley saving the dependant variable and doing some brief analysis on our target variable.

In [None]:
df['reviews.rating'].describe()

In [None]:
sns.distplot(df['reviews.rating'],kde = False)

Data Munging and needed Transformations

The first item is to look at what data we are missing. Considering we are dealing with Online Reviews there may be a lot of fields that people do not always fill in. We will create a table showing what percent of values are missing from each column. From here we will decide what is acceptable and what needs to be altered.

In [None]:
def missing_values(x):
    Total = x.isnull().sum()
    Percent = (Total / (len(x)))*100
    Missing = pd.concat([Total,Percent], axis = 1)
    Missing = Missing.rename(columns = {0: "Total Missing Values", 1:"% Missing"})
    print(Missing)
print(missing_values(df))

Because this is a starting analysis we are going to keep it simple and drop all columns that have at least 1/3 of their data missing. In addition, we are going to drop all rows in "reviews.text" and "reviews.title"  that are missing data. This is because they are missing a very small % and it is going to be the focus of this analysis,

In [None]:
textframe = df.drop(['reviews.userCity', 'reviews.userProvince', 'reviews.didPurchase', 'reviews.id','reviews.numHelpful', 'ean'], axis = 1)

In [None]:
textframe = textframe.dropna(subset=['reviews.text'])
textframe = textframe.dropna(subset=['reviews.title'])

In [None]:
def missing_values(x):
    Total = x.isnull().sum()
    Percent = (Total / (len(x)))*100
    Missing = pd.concat([Total,Percent], axis = 1)
    Missing = Missing.rename(columns = {0: "Total Missing Values", 1:"% Missing"})
    print(Missing)
print(missing_values(textframe))

We can see that we have eliminated some unnessesary infromation from our dataset and cleaned up the "text" and "title" datasets. We have to be careful we are not eliminating too much infrormation when our dataset is small; however, with "only" 70,000 lines of data we should have enough information.

In [None]:
textframe.info()

FIrst, lets start by making a word cloud to see what are the most common words that people use in their title and reviews.

In [None]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(stopwords = stopwords,
                          max_words = 300,
                          max_font_size = 45,
                          scale = 2).generate(str(data))
    
    fig = plt.figure(figsize = (15,15))
    plt.imshow(wordcloud)
    plt.show()

In [None]:
show_wordcloud(textframe['reviews.text'])

In [None]:
show_wordcloud(textframe['reviews.title'])

Intuition and a quick look at the wordclouds between the title and the text seems to show many more descriptive words being used for the title. This makes sense as it is suppose to draw people in to reading the rest of the review.

After doing this I wanted to add the length of the title and the length of the review to this analysis and do some quick analysis.

In [None]:
textframe["text_length"] = textframe['reviews.text'].apply(len)
textframe["title_length"] = textframe['reviews.title'].apply(len)

In [None]:
sns.jointplot(textframe['text_length'], textframe['title_length'])

In [None]:
sns.set(font_scale = 1.0)
g = sns.FacetGrid(textframe, col = 'reviews.rating', size = 5)
g.map(plt.hist, 'text_length')

In [None]:
sns.set(font_scale = 1.0)
g = sns.FacetGrid(textframe, col = 'reviews.rating', size = 5)
g.map(plt.hist, 'title_length')

Let us use the n-gram tfidf vectorizer and see if we get better accuracy with just text, just the title, or a combination.

In [None]:
from sklearn.model_selection import cross_val_score, train_test_split
from scipy.sparse import hstack
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
word_vectorizer = TfidfVectorizer(
    min_df = 3,
    strip_accents = 'unicode',
    max_features = None,
    analyzer = 'word',
    token_pattern = r'\w{1,}',
    ngram_range = (1,1), 
    use_idf = 1,
    smooth_idf = 1,
    sublinear_tf=1,
    stop_words = 'english')

word_vectorizer.fit(textframe["reviews.text"])
train_word_features = word_vectorizer.transform(textframe["reviews.text"])

In [None]:
title_vectorizer = TfidfVectorizer(
    min_df = 3,
    strip_accents = 'unicode',
    max_features = None,
    analyzer = 'word',
    token_pattern = r'\w{1,}',
    ngram_range = (1,1), 
    use_idf = 1,
    smooth_idf = 1,
    sublinear_tf=1,
    stop_words = 'english')
title_vectorizer.fit(textframe["reviews.title"])
train_title_features = word_vectorizer.transform(textframe["reviews.title"])

In [None]:
train_features = hstack([train_title_features, train_word_features])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_features,textframe["reviews.rating"] ,test_size=0.2,random_state=101)

First let's try a Random Forest Classifier to identify review ratings.'

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
classifier = RandomForestClassifier(n_estimators = 20,min_samples_leaf = 3 )
classifier.fit(X_train,y_train)

In [None]:
preds=classifier.predict(X_test)

In [None]:
print(accuracy_score(y_test,preds))

In [None]:
preds=classifier.predict(X_test)

In [None]:
print(accuracy_score(y_test,preds))

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
clf = MultinomialNB().fit(X_train, y_train)

In [None]:
predsclf = clf.predict(X_test)

In [None]:
print(accuracy_score(y_test,predsclf))

In [None]:
from sklearn.svm import SVC

In [None]:
model = SVC()

In [None]:
model.fit(X_train,y_train)

In [None]:
predsvm = model.predict(X_test)

In [None]:
print(accuracy_score(y_test,predsvm))

In [None]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']} 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)

In [None]:
grid.fit(X_train,y_train)