# Star Rating Predictor Based on Text Reviews

This project takes a scrapped dataset from a product and the goal will be to use Logistic Regression and Naive Bayes to try and predict the ratings, based on the text review.
The preprocessing stage will be cleaning the data and getting it ready for modelling. It will involve: Importing, removing punctuation, lowercasing, tokenising and stemming.

In [129]:
import pandas as pd
import numpy as np
import nltk
import string 
import re

# Data splitting
from sklearn.model_selection import train_test_split
# Preprocessing/analysis
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import GaussianNB

from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.metrics import accuracy_score
nltk.download("stopwords")
nltk.download("wordnet")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ellio\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ellio\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Importing Data

In [2]:
#Reading data
df = pd.read_csv("where-the-crawdads-sing-book.csv")
df.head()

Unnamed: 0,product,title,rating,text_review
0,Where the Crawdads Sing,I stopped reading,1.0,I managed to get halfway through this book. Th...
1,Where the Crawdads Sing,Stick With It,4.0,Gradually abandoned by her family and shunned ...
2,Where the Crawdads Sing,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...
3,Where the Crawdads Sing,The most amazing book,5.0,I was so looking forward to this book and I ha...
4,Where the Crawdads Sing,"Croak, Croak",3.0,When a female friend I trust recommended this ...


In [3]:
df.shape

(3462, 4)

In [4]:
#Checking how many reviews of each there are
df.rating.value_counts()

5.0    2980
4.0     270
3.0      92
1.0      66
2.0      54
Name: rating, dtype: int64

In [5]:
#Checking to see if there is any null data
df[["rating", "text_review"]].isnull().any()

rating         False
text_review    False
dtype: bool

In [6]:
#Checking all reviews match product and dropping col
(df["product"] != "Where the Crawdads Sing 	").value_counts()


True    3462
Name: product, dtype: int64

In [7]:
df.drop(["product"], axis=1, inplace=True)
df.head()

Unnamed: 0,title,rating,text_review
0,I stopped reading,1.0,I managed to get halfway through this book. Th...
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...
3,The most amazing book,5.0,I was so looking forward to this book and I ha...
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...


# Pre-Processing

- Lowercasing
- Remove special characters
- Remove stop words
- Stem words

In [8]:
#Make column lower
df["lower"] = df["text_review"].str.lower()

In [9]:
df.head()

Unnamed: 0,title,rating,text_review,lower
0,I stopped reading,1.0,I managed to get halfway through this book. Th...,i managed to get halfway through this book. th...
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...,gradually abandoned by her family and shunned ...
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...,the blurb of this book suggest an interesting ...
3,The most amazing book,5.0,I was so looking forward to this book and I ha...,i was so looking forward to this book and i ha...
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...,when a female friend i trust recommended this ...


In [10]:
#Make variable and check stopwords
stop_words = stopwords.words("english")


In [11]:
#Remove punc
df["no_punc"] = df["lower"].str.replace("[^\w\s]", "")

In [12]:
#Removing numbers
df["remove_numbers"] = df["no_punc"].str.replace("\d+", "")

In [13]:
#Tokenise
df["tokenise"] = [re.split("\W+", word) for word in df["remove_numbers"]]

In [14]:
df.head()

Unnamed: 0,title,rating,text_review,lower,no_punc,remove_numbers,tokenise
0,I stopped reading,1.0,I managed to get halfway through this book. Th...,i managed to get halfway through this book. th...,i managed to get halfway through this book tho...,i managed to get halfway through this book tho...,"[i, managed, to, get, halfway, through, this, ..."
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...,gradually abandoned by her family and shunned ...,gradually abandoned by her family and shunned ...,gradually abandoned by her family and shunned ...,"[gradually, abandoned, by, her, family, and, s..."
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...,the blurb of this book suggest an interesting ...,the blurb of this book suggest an interesting ...,the blurb of this book suggest an interesting ...,"[the, blurb, of, this, book, suggest, an, inte..."
3,The most amazing book,5.0,I was so looking forward to this book and I ha...,i was so looking forward to this book and i ha...,i was so looking forward to this book and i ha...,i was so looking forward to this book and i ha...,"[i, was, so, looking, forward, to, this, book,..."
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...,when a female friend i trust recommended this ...,when a female friend i trust recommended this ...,when a female friend i trust recommended this ...,"[when, a, female, friend, i, trust, recommende..."


In [15]:
#Dropping extra columns that won't be needed
df.drop(["lower", "no_punc", "remove_numbers",], axis=1, inplace=True)
df.head()

Unnamed: 0,title,rating,text_review,tokenise
0,I stopped reading,1.0,I managed to get halfway through this book. Th...,"[i, managed, to, get, halfway, through, this, ..."
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...,"[gradually, abandoned, by, her, family, and, s..."
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...,"[the, blurb, of, this, book, suggest, an, inte..."
3,The most amazing book,5.0,I was so looking forward to this book and I ha...,"[i, was, so, looking, forward, to, this, book,..."
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...,"[when, a, female, friend, i, trust, recommende..."


In [16]:
#Function to remove stopwords from tokenised column
def remove_stopwords(tokenised_text):
    cleaned_text = [word for word in tokenised_text if word not in stop_words]
    return cleaned_text

In [17]:
df["no_stopwords"] = df["tokenise"].apply(lambda x: remove_stopwords(x))

In [18]:
#Checking to see stopwords have been removed
df["no_stopwords"].head()

0    [managed, get, halfway, book, thought, somewha...
1    [gradually, abandoned, family, shunned, locals...
2    [blurb, book, suggest, interesting, crime, thr...
3    [looking, forward, book, say, didnt, disappoin...
4    [female, friend, trust, recommended, book, dow...
Name: no_stopwords, dtype: object

In [19]:
df.head()

Unnamed: 0,title,rating,text_review,tokenise,no_stopwords
0,I stopped reading,1.0,I managed to get halfway through this book. Th...,"[i, managed, to, get, halfway, through, this, ...","[managed, get, halfway, book, thought, somewha..."
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...,"[gradually, abandoned, by, her, family, and, s...","[gradually, abandoned, family, shunned, locals..."
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...,"[the, blurb, of, this, book, suggest, an, inte...","[blurb, book, suggest, interesting, crime, thr..."
3,The most amazing book,5.0,I was so looking forward to this book and I ha...,"[i, was, so, looking, forward, to, this, book,...","[looking, forward, book, say, didnt, disappoin..."
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...,"[when, a, female, friend, i, trust, recommende...","[female, friend, trust, recommended, book, dow..."


In [20]:
#Dropping tokenised column 
df.drop(["tokenise"], axis=1, inplace=True)

In [21]:
#Stemming the words
ps = PorterStemmer()

In [22]:
def stem(text):
    stem_text = [ps.stem(word) for word in text]
    return stem_text

In [23]:
df["stem_review"] = df["no_stopwords"].apply(lambda x: stem(x))

In [24]:
df["stem_review"].head()

0    [manag, get, halfway, book, thought, somewhat,...
1    [gradual, abandon, famili, shun, local, barkle...
2    [blurb, book, suggest, interest, crime, thrill...
3    [look, forward, book, say, didnt, disappoint, ...
4    [femal, friend, trust, recommend, book, downlo...
Name: stem_review, dtype: object

In [25]:
df.head()

Unnamed: 0,title,rating,text_review,no_stopwords,stem_review
0,I stopped reading,1.0,I managed to get halfway through this book. Th...,"[managed, get, halfway, book, thought, somewha...","[manag, get, halfway, book, thought, somewhat,..."
1,Stick With It,4.0,Gradually abandoned by her family and shunned ...,"[gradually, abandoned, family, shunned, locals...","[gradual, abandon, famili, shun, local, barkle..."
2,Boring about a girl walking through mud.,1.0,The blurb of this book suggest an interesting ...,"[blurb, book, suggest, interesting, crime, thr...","[blurb, book, suggest, interest, crime, thrill..."
3,The most amazing book,5.0,I was so looking forward to this book and I ha...,"[looking, forward, book, say, didnt, disappoin...","[look, forward, book, say, didnt, disappoint, ..."
4,"Croak, Croak",3.0,When a female friend I trust recommended this ...,"[female, friend, trust, recommended, book, dow...","[femal, friend, trust, recommend, book, downlo..."


In [26]:
#Dropping all columns apart from rating and stemmed reviews
df.drop(["title",], axis=1, inplace=True)

In [27]:
df.head()

Unnamed: 0,rating,text_review,no_stopwords,stem_review
0,1.0,I managed to get halfway through this book. Th...,"[managed, get, halfway, book, thought, somewha...","[manag, get, halfway, book, thought, somewhat,..."
1,4.0,Gradually abandoned by her family and shunned ...,"[gradually, abandoned, family, shunned, locals...","[gradual, abandon, famili, shun, local, barkle..."
2,1.0,The blurb of this book suggest an interesting ...,"[blurb, book, suggest, interesting, crime, thr...","[blurb, book, suggest, interest, crime, thrill..."
3,5.0,I was so looking forward to this book and I ha...,"[looking, forward, book, say, didnt, disappoin...","[look, forward, book, say, didnt, disappoint, ..."
4,3.0,When a female friend I trust recommended this ...,"[female, friend, trust, recommended, book, dow...","[femal, friend, trust, recommend, book, downlo..."


# Splitting Dataset and Modelling

I will be useing tfidf vectoriser before splitting the dataset.
Then I will be using Logistic Regression at first. Then, if needed explore some other models.

In [28]:
#Making list into string so I can vectorise
df["test"] = df["stem_review"].apply(" ".join)

In [29]:
#Using tfidf vect
tfidf_vect = TfidfVectorizer(max_features=5000)

In [30]:
#Splitting X and Y data
X = df["test"]
y = df["rating"]

In [31]:
#Tfidf vect variable
X = tfidf_vect.fit_transform(X)

In [32]:
#Checking data type
X

<3462x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 64016 stored elements in Compressed Sparse Row format>

In [33]:
#Splitting dataset into test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [34]:
X_train.shape

(2769, 5000)

In [35]:
X_test.shape

(693, 5000)

# Logistic Regression

In [36]:
#Implementing Logistic Regression Model
log_model = LogisticRegression()

In [37]:
log_model = log_model.fit(X=X_train, y=y_train)

In [38]:
y_pred = log_model.predict(X_train)

In [39]:
series_y_pred = pd.Series(y_pred)

In [40]:
series_y_pred.value_counts()

5.0    2743
4.0      23
3.0       3
dtype: int64

In [171]:
accuracy_score(y_train, y_pred)

0.8699891657638137

The model is far off from actual values and is mostly predicting 5 star ratings. It's saying it has a 86.99% accuracy but that is because the dataset is heavily weighted towards 5 star rating. I will try Bayes next, then run the models again after balancing the datasets.

In [41]:
y_train.value_counts()

5.0    2387
4.0     218
3.0      75
1.0      48
2.0      41
Name: rating, dtype: int64

In [42]:
log_model.intercept_

array([-0.71510199, -1.20493741, -0.76688573,  0.1711766 ,  2.51574852])

In [43]:
log_model.coef_

array([[-0.00351034, -0.00259647, -0.00457189, ..., -0.01143033,
        -0.00146059, -0.00146059],
       [-0.00302122, -0.00294864, -0.00430001, ..., -0.00896073,
        -0.00164358, -0.00164358],
       [-0.00562806, -0.00638401, -0.00724745, ..., -0.01552109,
        -0.00303596, -0.00303596],
       [-0.01598927, -0.01108417, -0.01630425, ..., -0.06025408,
        -0.02115917, -0.02115917],
       [ 0.02814889,  0.0230133 ,  0.0324236 , ...,  0.09616622,
         0.0272993 ,  0.0272993 ]])

In [44]:
#Checking overall rating of y test data
y_train.mean()

4.7533405561574575

In [45]:
#Comparing it to model
series_y_pred.mean()

4.989526905019863

# Bayes Model

In [46]:
#Had to use toarray because X_train was a sparse matrix
classifier = GaussianNB()
classifier.fit(X_train.toarray(), y_train)

GaussianNB()

In [47]:
y_pred_NB = classifier.predict(X_train.toarray())

In [48]:
series_y_pred_NB = pd.Series(y_pred_NB)

In [49]:
series_y_pred_NB.value_counts()

5.0    1334
4.0     667
3.0     311
1.0     282
2.0     175
dtype: int64

In [172]:
accuracy_score(y_new_train, y_new_pred_NB)

0.878

Model accuracy is similar to logistic regression. However, the same reason applies here, the dataset is weighted towards 5 star ratings, resulting in a higher accuracy than in reality.

# Over/Undersampling

Both model's predictions are off. This could be because of imbalanced a dataset. I'll to balance the datasets with over and undersampling. 

In [50]:
#Seeing how many ratings there are for each star
df["rating"].value_counts()

5.0    2980
4.0     270
3.0      92
1.0      66
2.0      54
Name: rating, dtype: int64

In [92]:
#I'll use some oversampling, as there isn't much data for the lower stars. I'll try the models with 250 reviews 
df_oversampling = pd.concat([df[df["rating"] == 5.0].sample(250), df[df["rating"] == 4.0].sample(250)])
df_oversampling.head()

Unnamed: 0,rating,text_review,no_stopwords,stem_review,test
1774,5.0,Beautifully written. Engaging and moving story.,"[beautifully, written, engaging, moving, story]","[beauti, written, engag, move, stori]",beauti written engag move stori
1731,5.0,The word 'heartbreaking' always puts me off a ...,"[word, heartbreaking, always, puts, book, revi...","[word, heartbreak, alway, put, book, review, d...",word heartbreak alway put book review dont wan...
1400,5.0,The books just flows along brilliantly. Just l...,"[books, flows, along, brilliantly, loved, want...","[book, flow, along, brilliantli, love, want, e...",book flow along brilliantli love want endbeaut...
1458,5.0,This book moved through me as I moved through ...,"[book, moved, moved, marsh, kya, could, hear, ...","[book, move, move, marsh, kya, could, hear, bi...",book move move marsh kya could hear bird feel ...
2820,5.0,Really enjoyed this,"[really, enjoyed]","[realli, enjoy]",realli enjoy


In [90]:
#Oversampling ratings 1-3
rating_3 = pd.concat([df[df["rating"] == 3.0]]*3, ignore_index = True)
rating_2 = pd.concat([df[df["rating"] == 2.0]]*5, ignore_index = True)
rating_1 = pd.concat([df[df["rating"] == 1.0]]*5, ignore_index = True)
rating_3.shape, rating_2.shape, rating_1.shape

((276, 5), (270, 5), (330, 5))

In [93]:
#Adding 250 rewviews from each ratings
df_oversampling = pd.concat([df_oversampling, rating_1.sample(250), rating_2.sample(250), rating_3.sample(250)])

In [97]:
df_oversampling.shape

(1250, 5)

In [98]:
df_oversampling.head(2)

Unnamed: 0,rating,text_review,no_stopwords,stem_review,test
1774,5.0,Beautifully written. Engaging and moving story.,"[beautifully, written, engaging, moving, story]","[beauti, written, engag, move, stori]",beauti written engag move stori
1731,5.0,The word 'heartbreaking' always puts me off a ...,"[word, heartbreaking, always, puts, book, revi...","[word, heartbreak, alway, put, book, review, d...",word heartbreak alway put book review dont wan...


In [107]:
#Splitting X and Y data for new samples
X_new_sample = df_oversampling["test"]
y_new_sample = df_oversampling["rating"]

In [108]:
#Tfidf vect variable
X_new_sample = tfidf_vect.fit_transform(X_new_sample)

In [109]:
#Splitting dataset into test and train
X_new_train, X_new_test, y_new_train, y_new_test = train_test_split(X_new_sample, y_new_sample, test_size = 0.2, random_state = 0)

In [110]:
X_new_train.shape, X_new_test.shape

((1000, 3470), (250, 3470))

In [111]:
#Logistic regression on new balanced samples
new_log_model = LogisticRegression()

In [112]:
new_log_model = log_model.fit(X=X_new_train, y=y_new_train)

In [113]:
y_new_pred = new_log_model.predict(X_new_train)

In [114]:
#Converting to series to see value counts
series_y_new_pred = pd.Series(y_new_pred)

In [115]:
#Should be 200 of each rating
series_y_new_pred.value_counts()

5.0    218
3.0    203
2.0    201
4.0    190
1.0    188
dtype: int64

In [130]:
#Showing the accuracy score for Logistic Regression with balanced dataset
accuracy_score(y_new_train, y_new_pred)

0.958

In [124]:
#Running the model on the new sampled test data

In [125]:
y_test_model = new_log_model.predict(X_new_test)

In [126]:
series_test = pd.Series(y_new_test)

In [127]:
#Running test, should be 50 of each
series_test.value_counts()

1.0    55
2.0    52
4.0    51
3.0    49
5.0    43
Name: rating, dtype: int64

In [140]:
#Showing the accuracy score for Logistic Regression test
accuracy_score(y_new_test, y_test_model)

0.992

The model is producing better accuracy after balancing the datasets. I'll now look at Bayes using the same under/oversampled dataset.

In [142]:
#Bayes with balanced dataset
new_classifier = GaussianNB()
new_classifier.fit(X_new_train.toarray(), y_new_train)

GaussianNB()

In [155]:
y_new_pred_NB = new_classifier.predict(X_new_train.toarray())

In [156]:
new_series_y_pred_NB = pd.Series(y_new_pred_NB)

In [157]:
new_series_y_pred_NB.value_counts()

1.0    257
2.0    222
5.0    197
3.0    186
4.0    138
dtype: int64

In [158]:
#Bayes accuracy for the training model
accuracy_score(y_new_train, y_new_pred_NB)

0.878

In [150]:
#Bayes with test dataset
test_classifier = GaussianNB()
test_classifier.fit(X_new_test.toarray(), y_new_test)

GaussianNB()

In [162]:
y_test_pred_NB = test_classifier.predict(X_new_test.toarray())

In [163]:
test_series_y_NB = pd.Series(y_test_pred_NB)

In [164]:
test_series_y_NB.value_counts()

1.0    56
2.0    52
4.0    50
5.0    47
3.0    45
dtype: int64

In [165]:
#Bayes accuracy for the test model
accuracy_score(y_new_test, y_test_pred_NB)

0.972

Bayes is also producing more accurate results after the balancing of the dataset.