# Task 2. Review Sentiment Classification

20210848
Jiaheng Guo

In [49]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from scipy.sparse import csr_matrix
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

#### 1.Load the data from Task 1, link review title and review body

In [50]:
data = pd.read_csv("review.csv", usecols=[1, 2, 3, 4]) #ignore index

# concatenate the review’s title and body text
data["text"] = ""
data["text"] = data["title"] + " " + data["body"]
data.head()

Unnamed: 0,title,body,help_information,star,text
0,The herbs were great...but the cherry tomatoe...,The herb kit that came with my Aerogarden was ...,15 out of 17 users found this review helpful,2-star,The herbs were great...but the cherry tomatoe...
1,Even more useful than regular parchment paper,I originally bought this just because it was c...,19 out of 19 users found this review helpful,5-star,Even more useful than regular parchment paper...
2,Shake it before you bake it,"If you do it in reverse (bake before shaking),...",2 out of 13 users found this review helpful,2-star,Shake it before you bake it If you do it in r...
3,Not what the picture describes,I bought this steak for my father in law for C...,7 out of 14 users found this review helpful,2-star,Not what the picture describes I bought this ...
4,What a ripe off - GIVE ME A BREAK,Sorry but I had these noodles and they are no ...,10 out of 34 users found this review helpful,2-star,What a ripe off - GIVE ME A BREAK Sorry but I...


#### 2.Assign a class label to each review

In [51]:
data["label"] = ""
# 3 stars and below are "negative", 4 and 5 stars are “positive”
data.loc[data["star"] == "1-star","label"] = "negative"
data.loc[data["star"] == "2-star","label"] = "negative"
data.loc[data["star"] == "3-star","label"] = "negative"
data.loc[data["star"] == "4-star","label"] = "positive"
data.loc[data["star"] == "5-star","label"] = "positive"
data.head()

Unnamed: 0,title,body,help_information,star,text,label
0,The herbs were great...but the cherry tomatoe...,The herb kit that came with my Aerogarden was ...,15 out of 17 users found this review helpful,2-star,The herbs were great...but the cherry tomatoe...,negative
1,Even more useful than regular parchment paper,I originally bought this just because it was c...,19 out of 19 users found this review helpful,5-star,Even more useful than regular parchment paper...,positive
2,Shake it before you bake it,"If you do it in reverse (bake before shaking),...",2 out of 13 users found this review helpful,2-star,Shake it before you bake it If you do it in r...,negative
3,Not what the picture describes,I bought this steak for my father in law for C...,7 out of 14 users found this review helpful,2-star,Not what the picture describes I bought this ...,negative
4,What a ripe off - GIVE ME A BREAK,Sorry but I had these noodles and they are no ...,10 out of 34 users found this review helpful,2-star,What a ripe off - GIVE ME A BREAK Sorry but I...,negative


#### 3.create a numeric representation of the review

Represents text as a number or a vector of numbers. The method used in this assignment is the bag-of-words model. 

Text vectorization, done by Scikit-Learn CountVectorizer and TfidfTransformer

Each of text is represented by a vector of the counts of words from a vocabulary in that document

In [52]:
vectorizer_1d = CountVectorizer(ngram_range=(1, 1)) # 1D vectorization
vectorizer_2d = CountVectorizer(ngram_range=(1, 2)) # 2D vectorization
X_1d = vectorizer_1d.fit_transform(data["text"])
X_2d = vectorizer_2d.fit_transform(data["text"])
print(X_1d.shape,X_2d.shape)

(36976, 26527) (36976, 366742)


Using term frequency multiplied by inverse document frequency (tf-idf) instead of word count, we can get slightly better results. Because some terms may appear very frequently in the document, but they are not so relevant to the document. This is because these terms may also have a high frequency in the collection of all documents. For example, the product name may appear frequently in reviews of the product.

In [53]:
tf_idf_transformer_1d = TfidfTransformer()
X_tf_idf_1d = tf_idf_transformer_1d.fit_transform(X_1d)
tf_idf_transformer_2d = TfidfTransformer()
X_tf_idf_2d = tf_idf_transformer_2d.fit_transform(X_2d)
print(X_tf_idf_1d.shape,X_tf_idf_2d.shape)

(36976, 26527) (36976, 366742)


#### 4.Fit Gaussian Bayes Classifier and Naive Bayes Classifier, print the score.

In [54]:
def fit_and_score(title, clf, X: csr_matrix, Y):
    # Separate training and test sets(0.75:0.25)
    X_train, X_valid, y_train, y_valid = train_test_split(
        X, Y, train_size=0.75, stratify=Y, random_state=0
    )
    # print(X_train, X_valid, y_train, y_valid)
    # At first, I wanted to try the Gaussian Bayes classifier, but it does not
    # accept sparse matrices as training data. When I want to convert it into an array form, the error message indicates
    # that the memory is insufficient and requires 18.9G of memory space. So I am modifying the bag of words to be
    # one-dimensional. The classifier that accepts sparse matrices as input parameters later will use 2D and
    # 1D word-bag

    if clf.__class__ == GaussianNB().__class__:
        # Gaussian Bayes classifier does not accept sparse matrix as input,so .toarray is needed.
        clf.fit(X_train.toarray(), y_train)
        train_score = clf.score(X_train.toarray(), y_train)
        valid_score = clf.score(X_valid.toarray(), y_valid)
        print('{}\nTrain score: {} ; Validation score: {}\n'.format(title, round(train_score, 2), round(valid_score, 2)))
    else:
        clf.fit(X_train, y_train)
        train_score = clf.score(X_train, y_train)
        valid_score = clf.score(X_valid, y_valid)
        print('{}\nTrain score: {} ; Validation score: {}\n'.format(title, round(train_score, 2), round(valid_score, 2)))
    return clf

In [55]:
fit_and_score("SGD Classifier, ngram_range=(1, 1)",SGDClassifier(), X_tf_idf_1d, data["label"].values)
fit_and_score("SGD Classifier, ngram_range=(1, 2)",SGDClassifier(), X_tf_idf_2d, data["label"].values)
fit_and_score("GaussianNB Classifier, ngram_range=(1, 1)",GaussianNB(), X_tf_idf_1d, data["label"].values)

SGD Classifier, ngram_range=(1, 1)
Train score: 0.96 ; Validation score: 0.95

SGD Classifier, ngram_range=(1, 2)
Train score: 1.0 ; Validation score: 0.99

GaussianNB Classifier, ngram_range=(1, 1)
Train score: 0.9 ; Validation score: 0.9



GaussianNB()

#### 5.Report

As can be seen, the best data is Train score: 1.0; Validation score: 0.98.The data is a 2-dimensional bag of words with tf-idf

For one-dimensional data, the SGD classifier scores higher than the Gaussian Bayes classifier, and the Gaussian Bayes classifier does not accept sparse matrices as input, so if we want to process more data or that in higher dimensions, it will consume Huge memory, which could easily lead to errors

Overall, the validation scores for both classifiers are over 0.9.That’s not bad for our simple linear model.

# Task 3. Review Helpfulness Classification

#### 1.Assign a class label to each review¶

In [56]:
data["helpfulness"] = ""
# Use regular expressions to find the number of supporters and the total number of people for each comment
data["support"] = data["help_information"].str.extract("(\d+)").astype(int)
data["total"] = data["help_information"].str.extract("[ ](\d+)").astype(int)
# assuming more than half of the people think a review is helpful, then assign "helpful" to the review.
data.loc[data["support"]*1.0 / data["total"] > 0.5,"helpfulness"] = "helpful"
data.loc[data["support"]*1.0 / data["total"] <= 0.5,"helpfulness"] = "unhelpful"

In [57]:
data.head()

Unnamed: 0,title,body,help_information,star,text,label,helpfulness,support,total
0,The herbs were great...but the cherry tomatoe...,The herb kit that came with my Aerogarden was ...,15 out of 17 users found this review helpful,2-star,The herbs were great...but the cherry tomatoe...,negative,helpful,15,17
1,Even more useful than regular parchment paper,I originally bought this just because it was c...,19 out of 19 users found this review helpful,5-star,Even more useful than regular parchment paper...,positive,helpful,19,19
2,Shake it before you bake it,"If you do it in reverse (bake before shaking),...",2 out of 13 users found this review helpful,2-star,Shake it before you bake it If you do it in r...,negative,unhelpful,2,13
3,Not what the picture describes,I bought this steak for my father in law for C...,7 out of 14 users found this review helpful,2-star,Not what the picture describes I bought this ...,negative,unhelpful,7,14
4,What a ripe off - GIVE ME A BREAK,Sorry but I had these noodles and they are no ...,10 out of 34 users found this review helpful,2-star,What a ripe off - GIVE ME A BREAK Sorry but I...,negative,unhelpful,10,34


#### 2.Build two different binary classification models (GaussianNB Classifier,SGD Classifier) 

In [58]:
fit_and_score("SGD Classifier, ngram_range=(1, 1)",SGDClassifier(), X_tf_idf_1d, data["helpfulness"].values)
fit_and_score("SGD Classifier, ngram_range=(1, 2)",SGDClassifier(), X_tf_idf_2d, data["helpfulness"].values)
fit_and_score("GaussianNB Classifier, ngram_range=(1, 1)",GaussianNB(), X_tf_idf_1d, data["helpfulness"].values)

SGD Classifier, ngram_range=(1, 1)
Train score: 0.94 ; Validation score: 0.92

SGD Classifier, ngram_range=(1, 2)
Train score: 0.99 ; Validation score: 0.97

GaussianNB Classifier, ngram_range=(1, 1)
Train score: 0.92 ; Validation score: 0.92



GaussianNB()

#### 3.Report

Conclusion：

It can be seen that SGD Classifier, ngram_range=(1, 2) is still the best model,Train score: 0.99 ; Validation score: 0.97 

The results of the two models are not much different compared with that of task 2, SGD Classifier, ngram_range=(1, 2) has the best results.GaussianNB Classifier, ngram_range=(1, 1)has the the worst result.

Both of these simple models have high scores (over 0.9) in judging helpability and sentiment classification, which could be very effective methods. But the SGD classifier supports sparse matrices, which reduces memory usage and provides scalability. And in the process of running the program, it can be clearly felt that the speed of the SGD classifier is significantly faster than that of the Gaussian Bayesian classifier.


Future work:

1.Could use cross-validation to find optimal parameters to improve the model.

2.Use different kinds of classifiers