# Experiment 3: More Data

The previous experiment attempted to achieve high accuracy by using better features. Using the features that we found to work best we will increase the size of our dataset to investigate the impact. The previous experiment attempted to replicate the results of a paper, however this paper did not seem to use all of the data available to it. Here we will use all of the available data, while maintaining the balance of our two classes. The data used here will be about 10x bigger than in the last experiment.

In [1]:
from protos import review_set_pb2, review_pb2
review_set = review_set_pb2.ReviewSet()
with open("data/yelpZip", 'rb') as f:
  review_set.ParseFromString(f.read())
print(len(review_set.reviews))

608598


In [2]:
from sklearn.utils import shuffle

fake_reviews = list(filter(lambda x: x.label, review_set.reviews))
counter_fake = len(fake_reviews)
genuine_reviews = []
unused_genuine_reviews = []
counter_genuine = 0
for review in shuffle(review_set.reviews):
  if review.label == True:
    continue
  if counter_genuine <= counter_fake:
    genuine_reviews.append(review)
    counter_genuine += 1
  else:
    unused_genuine_reviews.append(review)
  
concatted_reviews = fake_reviews + genuine_reviews
print("fake:", len(fake_reviews))
print("real:", len(genuine_reviews))
print("all:", len(concatted_reviews))
print("unused real:", len(unused_genuine_reviews))

fake: 80466
real: 80467
all: 160933
unused real: 447665


In [3]:
targets = [x.label for x in concatted_reviews]

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from exp2_feature_extraction import reviews_by_reviewer
from exp2_feature_extraction import reviewer_features

corpus = [x.review_content for x in concatted_reviews]
unigram_count_vect = CountVectorizer()
unigram_count_vect.fit(corpus)

def get_features(reviews):
  reviews_corpus = [x.review_content for x in reviews]
  features_ngram_bow = unigram_count_vect.transform(reviews_corpus)

  reviews_reviewer_map = reviews_by_reviewer(reviews)
  features_reviewer = [reviewer_features(x.user_id, reviews_reviewer_map) for x in reviews]

  features = [features_reviewer for i in range(0, 4)]
  features.append(features_ngram_bow)

  return hstack([coo_matrix(x) for x in features])

In [5]:
from scipy.sparse import coo_matrix, hstack
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

predictor_features = get_features(concatted_reviews)
naive_bayes = MultinomialNB()
fold = 5
results = cross_validate(naive_bayes, predictor_features, targets, cv=fold, return_train_score=False)
print("Average accuracy:", sum([x for x in results['test_score']])/fold)

Average accuracy: 0.6736405194019781


The accuracy is slightly better with the larger dataset, which shows the last experiment trained quite well on the smaller dataset, or at least that an increase of 10x does not improve the model dramatically.

Now, because we have some spare data, let's see how many of the unused genuine reviews our model can correctly label as genuine:

In [6]:
naive_bayes.fit(predictor_features, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [7]:
unused_features = get_features(unused_genuine_reviews[:100000])
results = naive_bayes.predict(unused_features)

In [8]:
genuine_results = [x for x in results if x == False]
print(len(genuine_results), "of", len(results))

60188 of 100000


It appears that our classifier likes to classify things as fake, since our accuracy for an all genuine set is below the accuracy of our classifier. We don't have any unused fake reviews, but we can play around with our test set again, even though it's not very meaningful.

In [9]:
used_fake_features = get_features(fake_reviews)
used_fake_results = naive_bayes.predict(used_fake_features)

used_fake_correct = len([x for x in used_fake_results if x == True])
used_fake_total = len(used_fake_results)
print("Fake set:", used_fake_correct, "of", used_fake_total, "=", used_fake_correct/used_fake_total)

Fake set: 64485 of 80466 = 0.8013943777496085


In [10]:
used_genuine_features = get_features(genuine_reviews)
used_genuine_results = naive_bayes.predict(used_genuine_features)

used_genuine_correct = len([x for x in used_genuine_results if x == False])
used_genuine_total = len(used_genuine_results)
print("Fake set:", used_genuine_correct, "of", used_genuine_total, "=", used_genuine_correct/used_genuine_total)

Fake set: 50456 of 80467 = 0.627039656008053


## Conclusion

These accuracies are using data that was in the test set, so we can't trust them. But we do see that our classifier is biased to classify as fake on an arbitrary review. Since we have already balanced our data set, we can perhaps increase the accuracy of our classifier by steering it, to remove this bias.