# Experiment 4: Latest features on all statistical classifiers

After experimenting with different features that work for Naive Bayes, we will try these with other classifiers to see how they perform. It is likely that we could adapt the features to perform better, but as an initial step we will look at how they compare to naive bayes using similar features.

In [1]:
from latest_feature_extraction import get_balanced_dataset

reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()
print("fake:", len(fake_reviews))
print("real:", len(genuine_reviews))
print("all:", len(reviews_set))
print("unused real:", len(unused_genuine_reviews))

fake: 80466
real: 80467
all: 160933
unused real: 447665


In [2]:
targets = [x.label for x in reviews_set]

## Linear Discriminant Analysis

The first classifier we will try is LDA. Like Naive Bayes LDA can handle unscaled features, so we don't need to do any feature scaling yet.

In [3]:
from latest_feature_extraction import get_features_maker

We can use the latest feature extraction for this comparison. This uses the features that performed well with bag of words.

In [5]:
get_features = get_features_maker(reviews_set, 775)
predictor_features = get_features(reviews_set)

Here we can only make our bag of words features so big before we crash with MemoryError

In [6]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import cross_validate
import numpy
linearDA = LinearDiscriminantAnalysis()
results = cross_validate(linearDA, predictor_features.toarray(), numpy.asarray(targets), cv=2,
                         return_train_score=False)



In [7]:
sum(x for x in results['test_score'])/2

0.69606875

The result is impressive! Using features adapted to Naive Bayes, but in LDA we get a higher accuracy than with Naive Bayes. The warning 'UserWarning: Variables are collinear' is an indicator that the features are not different enough, and can not be used to differentiate their impacts. Since we will likely explore new features in the future, it is likely this message will be solved by replacing the features.

## Feature scaling

For the rest of the classifiers used here we normalise all our features to be between one and zero. We need to do this to suppress the mega features vs tiny features situation. If classifiers use Euclidian distance, then it has no knowledge of the units being used, and this is why we need to standardise it.

To do feature scaling our sparse features are problematic. We need to convert our data to array for standardisation. Since we cannot do this with sparse data without MemoryErrors, we will use dense features for our bag of words from experiment 2.

In [8]:
from latest_feature_extraction import dense_features_maker
from exp2_feature_extraction import find_words

corpus_words = [find_words(x.review_content) for x in reviews_set]

In [9]:
dense_features_getter = dense_features_maker(corpus_words, 75)

In [10]:
predictor_features_dense = [dense_features_getter(x) for x in corpus_words]

`predictor_features_dense` can be scaled since we can now convert it to an array that fits in memory

In [11]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(predictor_features_dense)

Note, because we're going straight for cross validation, the test set is part of what fits the scaler

In [12]:
scaled_features = scaler.transform(predictor_features_dense)

## Support Vector Machine

In [13]:
from sklearn.svm import LinearSVC
svc = LinearSVC(max_iter=10000)

In [16]:
cross_validate(svc, scaled_features, targets, cv=2, return_train_score=False)



{'fit_time': array([442.23583102, 492.59881949]),
 'score_time': array([0.02300429, 0.02286482]),
 'test_score': array([0.62902805, 0.63109885])}

The results are not as accurate as LDA. It is not a fair comparison because we had to use different, dense features here. These features would perform worse with LDA. This is the best we can do for now.

The 'ConvergenceWarning: Liblinear failed to converge, increase the number of iterations' message may be down to having features that are not helpful enough in convergence. Increasing max_iter does not appear to solve this.

## Logistic regression

In [17]:
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression()

In [18]:
cross_validate(logistic_regression, scaled_features, targets, cv=2, return_train_score=False)



{'fit_time': array([0.80156898, 0.82275629]),
 'score_time': array([0.02250504, 0.02260971]),
 'test_score': array([0.62996011, 0.63176994])}

Logistic regression shows very similar results to SVM, but executed much faster.

## K Nearest Neighbors

In [19]:
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier()

In [20]:
cross_validate(knn_classifier, scaled_features, targets, cv=2, return_train_score=False)

{'fit_time': array([20.18374848, 21.10131526]),
 'score_time': array([1139.47850466, 1165.53416753]),
 'test_score': array([0.59302571, 0.58950364])}

These different classifiers showed a range of accuracies. It would be interesting to see how adapting the features to these classifiers would affect the accuracy. For now LDA appears to be the most accurate, which is in line with other research done on this dataset.