# Star Rating Predictor Based on Text Reviews

This project takes a scrapped dataset from a product and the goal will be to use Logistic Regression and Naive Bayes to try and predict the ratings, based on the text review.
The preprocessing stage will be cleaning the data and getting it ready for modelling. It will involve: Importing, removing punctuation, lowercasing, tokenising and stemming.

In [174]:
import pandas as pd
import numpy as np
import nltk
import string 
import re

# Data splitting
from sklearn.model_selection import train_test_split
# Preprocessing/analysis
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import GaussianNB

from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.metrics import accuracy_score
nltk.download("stopwords")
nltk.download("wordnet")

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ellio\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ellio\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Importing Data

In [175]:
#Reading data
df = pd.read_csv("where-the-crawdads-sing-book.csv")
df.head()

Unnamed: 0,product,title,rating,text_review
0,Where the Crawdads Sing,I stopped reading,1.0,I managed to get halfway through this book. Th...
1,Where the Crawdads Sing,Stick With It,4.0,Gradually abandoned by her family and shunned ...
2,Where the Crawdads Sing,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...
3,Where the Crawdads Sing,The most amazing book,5.0,I was so looking forward to this book and I ha...
4,Where the Crawdads Sing,"Croak, Croak",3.0,When a female friend I trust recommended this ...


In [176]:
df.shape

(3462, 4)

In [177]:
#Checking how many reviews of each there are
df.rating.value_counts()

5.0    2980
4.0     270
3.0      92
1.0      66
2.0      54
Name: rating, dtype: int64

In [178]:
#Checking to see if there is any null data
df[["rating", "text_review"]].isnull().any()

rating         False
text_review    False
dtype: bool

In [179]:
#Checking all reviews match product and dropping col
(df["product"] != "Where the Crawdads Sing 	").value_counts()


True    3462
Name: product, dtype: int64

In [180]:
df.drop(["product"], axis=1, inplace=True)
df.head()

Unnamed: 0,title,rating,text_review
0,I stopped reading,1.0,I managed to get halfway through this book. Th...
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...
3,The most amazing book,5.0,I was so looking forward to this book and I ha...
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...


# Pre-Processing

- Lowercasing
- Remove special characters
- Remove stop words
- Stem words

In [181]:
#Make column lower
df["lower"] = df["text_review"].str.lower()

In [182]:
df.head()

Unnamed: 0,title,rating,text_review,lower
0,I stopped reading,1.0,I managed to get halfway through this book. Th...,i managed to get halfway through this book. th...
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...,gradually abandoned by her family and shunned ...
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...,the blurb of this book suggest an interesting ...
3,The most amazing book,5.0,I was so looking forward to this book and I ha...,i was so looking forward to this book and i ha...
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...,when a female friend i trust recommended this ...


In [183]:
#Make variable and check stopwords
stop_words = stopwords.words("english")


In [184]:
#Remove punc
df["no_punc"] = df["lower"].str.replace("[^\w\s]", "")

In [185]:
#Removing numbers
df["remove_numbers"] = df["no_punc"].str.replace("\d+", "")

In [186]:
#Tokenise
df["tokenise"] = [re.split("\W+", word) for word in df["remove_numbers"]]

In [187]:
df.head()

Unnamed: 0,title,rating,text_review,lower,no_punc,remove_numbers,tokenise
0,I stopped reading,1.0,I managed to get halfway through this book. Th...,i managed to get halfway through this book. th...,i managed to get halfway through this book tho...,i managed to get halfway through this book tho...,"[i, managed, to, get, halfway, through, this, ..."
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...,gradually abandoned by her family and shunned ...,gradually abandoned by her family and shunned ...,gradually abandoned by her family and shunned ...,"[gradually, abandoned, by, her, family, and, s..."
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...,the blurb of this book suggest an interesting ...,the blurb of this book suggest an interesting ...,the blurb of this book suggest an interesting ...,"[the, blurb, of, this, book, suggest, an, inte..."
3,The most amazing book,5.0,I was so looking forward to this book and I ha...,i was so looking forward to this book and i ha...,i was so looking forward to this book and i ha...,i was so looking forward to this book and i ha...,"[i, was, so, looking, forward, to, this, book,..."
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...,when a female friend i trust recommended this ...,when a female friend i trust recommended this ...,when a female friend i trust recommended this ...,"[when, a, female, friend, i, trust, recommende..."


In [188]:
#Dropping extra columns that won't be needed
df.drop(["lower", "no_punc", "remove_numbers",], axis=1, inplace=True)
df.head()

Unnamed: 0,title,rating,text_review,tokenise
0,I stopped reading,1.0,I managed to get halfway through this book. Th...,"[i, managed, to, get, halfway, through, this, ..."
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...,"[gradually, abandoned, by, her, family, and, s..."
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...,"[the, blurb, of, this, book, suggest, an, inte..."
3,The most amazing book,5.0,I was so looking forward to this book and I ha...,"[i, was, so, looking, forward, to, this, book,..."
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...,"[when, a, female, friend, i, trust, recommende..."


In [189]:
#Function to remove stopwords from tokenised column
def remove_stopwords(tokenised_text):
    cleaned_text = [word for word in tokenised_text if word not in stop_words]
    return cleaned_text

In [190]:
df["no_stopwords"] = df["tokenise"].apply(lambda x: remove_stopwords(x))

In [191]:
#Checking to see stopwords have been removed
df["no_stopwords"].head()

0    [managed, get, halfway, book, thought, somewha...
1    [gradually, abandoned, family, shunned, locals...
2    [blurb, book, suggest, interesting, crime, thr...
3    [looking, forward, book, say, didnt, disappoin...
4    [female, friend, trust, recommended, book, dow...
Name: no_stopwords, dtype: object

In [192]:
df.head()

Unnamed: 0,title,rating,text_review,tokenise,no_stopwords
0,I stopped reading,1.0,I managed to get halfway through this book. Th...,"[i, managed, to, get, halfway, through, this, ...","[managed, get, halfway, book, thought, somewha..."
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...,"[gradually, abandoned, by, her, family, and, s...","[gradually, abandoned, family, shunned, locals..."
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...,"[the, blurb, of, this, book, suggest, an, inte...","[blurb, book, suggest, interesting, crime, thr..."
3,The most amazing book,5.0,I was so looking forward to this book and I ha...,"[i, was, so, looking, forward, to, this, book,...","[looking, forward, book, say, didnt, disappoin..."
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...,"[when, a, female, friend, i, trust, recommende...","[female, friend, trust, recommended, book, dow..."


In [193]:
#Dropping tokenised column 
df.drop(["tokenise"], axis=1, inplace=True)

In [194]:
#Stemming the words
ps = PorterStemmer()

In [195]:
def stem(text):
    stem_text = [ps.stem(word) for word in text]
    return stem_text

In [196]:
df["stem_review"] = df["no_stopwords"].apply(lambda x: stem(x))

In [197]:
df["stem_review"].head()

0    [manag, get, halfway, book, thought, somewhat,...
1    [gradual, abandon, famili, shun, local, barkle...
2    [blurb, book, suggest, interest, crime, thrill...
3    [look, forward, book, say, didnt, disappoint, ...
4    [femal, friend, trust, recommend, book, downlo...
Name: stem_review, dtype: object

In [198]:
df.head()

Unnamed: 0,title,rating,text_review,no_stopwords,stem_review
0,I stopped reading,1.0,I managed to get halfway through this book. Th...,"[managed, get, halfway, book, thought, somewha...","[manag, get, halfway, book, thought, somewhat,..."
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...,"[gradually, abandoned, family, shunned, locals...","[gradual, abandon, famili, shun, local, barkle..."
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...,"[blurb, book, suggest, interesting, crime, thr...","[blurb, book, suggest, interest, crime, thrill..."
3,The most amazing book,5.0,I was so looking forward to this book and I ha...,"[looking, forward, book, say, didnt, disappoin...","[look, forward, book, say, didnt, disappoint, ..."
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...,"[female, friend, trust, recommended, book, dow...","[femal, friend, trust, recommend, book, downlo..."


In [199]:
#Dropping all columns apart from rating and stemmed reviews
df.drop(["title",], axis=1, inplace=True)

In [200]:
df.head()

Unnamed: 0,rating,text_review,no_stopwords,stem_review
0,1.0,I managed to get halfway through this book. Th...,"[managed, get, halfway, book, thought, somewha...","[manag, get, halfway, book, thought, somewhat,..."
1,4.0,Gradually abandoned by her family and shunned ...,"[gradually, abandoned, family, shunned, locals...","[gradual, abandon, famili, shun, local, barkle..."
2,1.0,The blurb of this book suggest an interesting ...,"[blurb, book, suggest, interesting, crime, thr...","[blurb, book, suggest, interest, crime, thrill..."
3,5.0,I was so looking forward to this book and I ha...,"[looking, forward, book, say, didnt, disappoin...","[look, forward, book, say, didnt, disappoint, ..."
4,3.0,When a female friend I trust recommended this ...,"[female, friend, trust, recommended, book, dow...","[femal, friend, trust, recommend, book, downlo..."


# Splitting Dataset and Modelling

I will be useing tfidf vectoriser before splitting the dataset.
Then I will be using Logistic Regression at first. Then, if needed explore some other models.

In [201]:
#Making list into string so I can vectorise
df["test"] = df["stem_review"].apply(" ".join)

In [202]:
#Using tfidf vect
tfidf_vect = TfidfVectorizer(max_features=5000)

In [203]:
#Splitting X and Y data
X = df["test"]
y = df["rating"]

In [204]:
#Tfidf vect variable
X = tfidf_vect.fit_transform(X)

In [205]:
#Checking data type
X

<3462x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 64016 stored elements in Compressed Sparse Row format>

In [206]:
#Splitting dataset into test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [207]:
X_train.shape

(2769, 5000)

In [208]:
X_test.shape

(693, 5000)

# Logistic Regression

In [209]:
#Implementing Logistic Regression Model
log_model = LogisticRegression()

In [210]:
log_model = log_model.fit(X=X_train, y=y_train)

In [211]:
y_pred = log_model.predict(X_train)

In [212]:
series_y_pred = pd.Series(y_pred)

In [213]:
series_y_pred.value_counts()

5.0    2743
4.0      23
3.0       3
dtype: int64

In [214]:
accuracy_score(y_train, y_pred)

0.8699891657638137

The model is far off from actual values and is mostly predicting 5 star ratings. It's saying it has a 86.99% accuracy but that is because the dataset is heavily weighted towards 5 star rating. I will try Bayes next, then run the models again after balancing the datasets.

In [215]:
y_train.value_counts()

5.0    2387
4.0     218
3.0      75
1.0      48
2.0      41
Name: rating, dtype: int64

In [216]:
log_model.intercept_

array([-0.71510199, -1.20493741, -0.76688573,  0.1711766 ,  2.51574852])

In [217]:
log_model.coef_

array([[-0.00351034, -0.00259647, -0.00457189, ..., -0.01143033,
        -0.00146059, -0.00146059],
       [-0.00302122, -0.00294864, -0.00430001, ..., -0.00896073,
        -0.00164358, -0.00164358],
       [-0.00562806, -0.00638401, -0.00724745, ..., -0.01552109,
        -0.00303596, -0.00303596],
       [-0.01598927, -0.01108417, -0.01630425, ..., -0.06025408,
        -0.02115917, -0.02115917],
       [ 0.02814889,  0.0230133 ,  0.0324236 , ...,  0.09616622,
         0.0272993 ,  0.0272993 ]])

In [218]:
#Checking overall rating of y test data
y_train.mean()

4.7533405561574575

In [219]:
#Comparing it to model
series_y_pred.mean()

4.989526905019863

# Bayes Model

In [220]:
#Had to use toarray because X_train was a sparse matrix
classifier = GaussianNB()
classifier.fit(X_train.toarray(), y_train)

GaussianNB()

In [221]:
y_pred_NB = classifier.predict(X_train.toarray())

In [222]:
series_y_pred_NB = pd.Series(y_pred_NB)

In [223]:
series_y_pred_NB.value_counts()

5.0    1334
4.0     667
3.0     311
1.0     282
2.0     175
dtype: int64

In [224]:
accuracy_score(y_train, y_pred_NB)

0.6016612495485735

Model accuracy is similar to logistic regression. However, the same reason applies here, the dataset is weighted towards 5 star ratings, resulting in a higher accuracy than in reality.

# Over/Undersampling

Both model's predictions are off. This could be because of imbalanced a dataset. I'll to balance the datasets with over and undersampling. 

In [225]:
#Seeing how many ratings there are for each star
df["rating"].value_counts()

5.0    2980
4.0     270
3.0      92
1.0      66
2.0      54
Name: rating, dtype: int64

Splitting train and test data, before splitting it into X and Y, so I can resample only the training data.

In [226]:
train_df, test_df = train_test_split(df, test_size = 0.2)

In [227]:
test_df["rating"].value_counts()

5.0    594
4.0     48
3.0     28
2.0     13
1.0     10
Name: rating, dtype: int64

In [228]:
train_df["rating"].value_counts()

5.0    2386
4.0     222
3.0      64
1.0      56
2.0      41
Name: rating, dtype: int64

In [229]:
#Oversampling ratings 1-3
rating_3 = pd.concat([train_df[train_df["rating"] == 3.0]]*3, ignore_index = True)
rating_2 = pd.concat([train_df[train_df["rating"] == 2.0]]*4, ignore_index = True)
rating_1 = pd.concat([train_df[train_df["rating"] == 1.0]]*3, ignore_index = True)



In [230]:
train_df = pd.concat([train_df, rating_3, rating_2, rating_1])

In [231]:
train_df["rating"].value_counts()

5.0    2386
3.0     256
1.0     224
4.0     222
2.0     205
Name: rating, dtype: int64

In [233]:
X_resample = train_df["test"]
y_resample = train_df["rating"]

In [236]:
#Tfidf vect variable
X_resample = tfidf_vect.fit_transform(X_resample)

In [237]:
#Logistic regression on new balanced samples
new_log_model = LogisticRegression()

In [254]:
resample_lr_model = new_log_model.fit(X=X_resample, y=y_resample)

In [255]:
y_resample_pred = resample_lr_model.predict(X_resample)

In [256]:
#Checking overall accuracy for resample train data, will now look at the test data
accuracy_score(y_resample, y_resample_pred)

0.9283328272092317

In [249]:
X_resample_test = test_df["test"]
y_resample_test = test_df["rating"]

In [250]:
X_resample_test = tfidf_vect.fit_transform(X_resample_test)

In [257]:
resample_lr_test = new_log_model.fit(X=X_resample_test, y= y_resample_test)

In [258]:
resample_test_y_pred = resample_lr_test.predict(X_resample_test)

In [259]:
accuracy_score(y_resample_test, resample_test_y_pred)

0.8585858585858586

The model is producing better accuracy after balancing the datasets. I'll now look at Bayes using the same under/oversampled dataset.

In [260]:
#Bayes with balanced dataset
new_classifier = GaussianNB()
new_classifier.fit(X_resample.toarray(), y_resample)

GaussianNB()

In [261]:
y_new_pred_NB = new_classifier.predict(X_resample.toarray())

In [262]:
accuracy_score(y_resample, y_new_pred_NB)

0.6662617673853629

In [264]:
#Now testing the test data with NB
test_nb_classifier = new_classifier.fit(X_resample_test.toarray(), y_resample_test)

In [265]:
y_test_NB = test_nb_classifier.predict(X_resample_test.toarray())

In [266]:
accuracy_score(y_resample_test, y_test_NB)

0.8513708513708513

Bayes is also producing more accurate results in the test data but is performing much lower in the train data.

# Cross-Validation

In [267]:
log_model = LogisticRegression()

In [268]:
train_scores = cross_val_score(log_model, X, y, scoring= "accuracy", cv = 10)

In [269]:
train_scores

array([0.86167147, 0.85878963, 0.86416185, 0.86127168, 0.86127168,
       0.86127168, 0.86127168, 0.86416185, 0.86127168, 0.86416185])

In [270]:
train_scores.mean()

0.8619305025736702

Logstic Regression averaged 86.19% over cross-validation. I will now try NB.

In [271]:
nb_model = GaussianNB()

In [272]:
train_nb_scores = cross_val_score(nb_model, X.toarray(), y, scoring = "accuracy", cv = 10)

In [273]:
train_nb_scores

array([0.76945245, 0.5648415 , 0.39884393, 0.33815029, 0.32080925,
       0.34393064, 0.27745665, 0.20520231, 0.33236994, 0.38150289])

In [274]:
train_nb_scores.mean()

0.39325598440805587

NB performed pretty poorly over the cross-validation samples. I will try a random forest classifier.

In [279]:
rf_model = RandomForestClassifier()

In [280]:
train_rf_scores = cross_val_score(rf_model, X.toarray(), y, scoring = "accuracy", cv = 10)

In [281]:
train_rf_scores

array([0.86167147, 0.85878963, 0.86127168, 0.86127168, 0.85549133,
       0.86127168, 0.86127168, 0.86127168, 0.85549133, 0.86127168])

In [282]:
train_rf_scores.mean()

0.8599073811863869

Random Forest performed similar to Logistic Regression, both scoring an accuracy of over 85%.