# Experiment 3: More Data

The previous experiment attempted to achieve high accuracy by using better features. Using the features that we found to work best we will increase the size of our dataset to investigate the impact. The previous experiment attempted to replicate the results of a paper, however this paper did not seem to use all of the data available to it. Here we will use all of the available data, while maintaining the balance of our two classes. The data used here will be about 10x bigger than in the last experiment.

In [1]:
from protos import review_set_pb2, review_pb2
review_set = review_set_pb2.ReviewSet()
with open("data/yelpZip", 'rb') as f:
  review_set.ParseFromString(f.read())
print(len(review_set.reviews))

608598


Now we split our dataset into fake and genuine, but also store the reviews that we didn't use. We use all of the fake reviews in the dataset, so the only unused reviews are genuine.

In [2]:
from sklearn.utils import shuffle

fake_reviews = list(filter(lambda x: x.label, review_set.reviews))
count_fake = len(fake_reviews)
genuine_reviews = []
unused_genuine_reviews = []
counter_genuine = 0
for review in shuffle(review_set.reviews):
  if review.label == True:
    continue
  if counter_genuine <= count_fake:
    genuine_reviews.append(review)
    counter_genuine += 1
  else:
    unused_genuine_reviews.append(review)
  
concatted_reviews = fake_reviews + genuine_reviews
print("fake:", len(fake_reviews))
print("real:", len(genuine_reviews))
print("all:", len(concatted_reviews))
print("unused real:", len(unused_genuine_reviews))

fake: 80466
real: 80467
all: 160933
unused real: 447665


In [1]:
from scripts.feature_extraction import get_balanced_dataset
concatted_reviews = get_balanced_dataset()

In [2]:
targets = [x.label for x in concatted_reviews]

In [4]:
from exp4_data_feature_extraction import get_features_maker
get_features = get_features_maker(concatted_reviews)

In [9]:
scoring = {
    'acc': 'accuracy',
    'auroc': 'roc_auc',
    'f1': 'f1'
}

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

predictor_features = get_features(concatted_reviews)
naive_bayes = MultinomialNB()
fold = 10
results = cross_validate(naive_bayes, predictor_features, targets, cv=fold, scoring=scoring, return_train_score=False)
#print("Average accuracy:", sum([x for x in results['test_acc']])/fold)

In [14]:
results

{'fit_time': array([0.46036673, 0.45923424, 0.45915699, 0.45723844, 0.36908245,
        0.37357187, 0.37405682, 0.37362266, 0.37323928, 0.37408853]),
 'score_time': array([0.04839873, 0.04749274, 0.04627275, 0.04383707, 0.03976035,
        0.04009771, 0.04002523, 0.03997731, 0.03998733, 0.04001665]),
 'test_acc': array([0.68787492, 0.69463944, 0.69272495, 0.6854499 , 0.68672623,
        0.6936822 , 0.69332397, 0.68387797, 0.69377074, 0.68713301]),
 'test_auroc': array([0.75767943, 0.76391493, 0.76046234, 0.75584793, 0.75551675,
        0.75623508, 0.75845146, 0.7520655 , 0.76099029, 0.7549497 ]),
 'test_f1': array([0.70892103, 0.71566938, 0.71409061, 0.7076339 , 0.70528907,
        0.71139971, 0.7143112 , 0.70624518, 0.71358644, 0.70583293])}

The accuracy is slightly better with the larger dataset, which shows the last experiment trained quite well on the smaller dataset, or at least that an increase of 10x does not improve the model dramatically.