Copy much of the procedures from previous explorations, but this time focus on using scikit-learn models and finding the best one.

Repository for this code: https://github.com/TheCDC/CSC413_Midterm_Project

Choosing a model: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

List of linear models: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model


Reference: https://www.cs.ucsb.edu/%7Ewilliam/papers/acl2017.pdf

Similar imports as before.

In [1]:
%matplotlib notebook
import pandas
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import time

In [2]:
truth_text_mapping = {
    'pants-fire': 0,
    'false': 1,
    'barely-true': 2,
    'half-true': 3,
    'mostly-true': 4,
    'true': 5,
}


class Statement:
    def __init__(self, body, speaker, value, context):
        self.body = body
        self.speaker = speaker
        self.value = truth_text_mapping[value]
        self.context = context

    @staticmethod
    def from_row(row):
        return Statement(value=row[1],
                         body=row[2],
                         speaker=row[4],
                         context=row[13])

    def __repr__(self):
        arg_str = str(', '.join(['='.join([i[0], repr(i[1])])
                                 for i in vars(self).items()]))
        return "Statement({})".format(arg_str)

    def __str__(self):
        return repr(self)

    @property
    def features(self):
        return ' '.join([self.speaker, self.context, self.body])


import csv


def load_liar_data(path):
    statements = []
    with open(path) as data_file:
        reader = csv.reader(data_file, delimiter='\t', quotechar='"')
        for row in reader:
            try:
                statements.append(Statement.from_row(row))
            except IndexError:
                print(row, len(row))
    return statements


statements = load_liar_data("../datasets/LIAR/train.tsv")
# print out some statements to verify by eye.
len(statements)

10241

Vectorize the training data via sklearn.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

x = vectorizer.fit_transform([s.features for s in statements])
y = np.array([s.value for s in statements]).ravel()

# vocab
# statements[0].body
print('Vocab size:', x.shape)
x

Vocab size: (10241, 14552)


<10241x14552 sparse matrix of type '<class 'numpy.int64'>'
	with 221431 stored elements in Compressed Sparse Row format>

Validate the regression model against the input data.

In [4]:
clf = linear_model.Ridge(fit_intercept=True, alpha=0.01)
clf.fit(x, y)
clf.score(x, y)

0.73765150487092312

A bad score on the training data is not promising.

In [5]:
test_statements = load_liar_data("../datasets/LIAR/test.tsv")
x_test = vectorizer.transform([s.features for s in test_statements])
y_test = np.array([s.value for s in test_statements]).ravel()
clf.score(x_test, y_test)

-0.18428915852079597

In [6]:
# print out some predicted vs. actual values
list(zip(clf.predict(x_test), y_test[:20]))

[(1.6981627780554429, 5),
 (2.3943054810660889, 1),
 (1.4846024236565638, 1),
 (1.5923527449103883, 3),
 (0.61512542788298097, 0),
 (3.1396738979990655, 5),
 (3.7472879880233894, 5),
 (0.35950924016343855, 2),
 (2.6628539526986645, 5),
 (4.4398964812644159, 2),
 (2.6073627088634286, 2),
 (0.0023495192461071568, 2),
 (0.19147823712872247, 0),
 (1.5131455488610017, 1),
 (2.0087359387436363, 3),
 (2.5033620682511883, 5),
 (1.8920736856669986, 0),
 (2.5360836997720897, 3),
 (2.9556110125594217, 5),
 (2.469653323641547, 1)]

In [7]:
# find optimal alpha for linear Ridge model
clf = linear_model.Ridge(fit_intercept=True)
best = None
for al in np.logspace(-1, 1, 10):
    clf.set_params(alpha=al)
    clf.fit(x, y)
    s = clf.score(x_test, y_test)
    if best is None or s > best[1]:
        best = (al, s)
    print(al, s)
print(best)

0.1 -0.168200480777
0.16681005372 -0.160992548303
0.278255940221 -0.150582729315
0.464158883361 -0.135034903273
0.774263682681 -0.111858085519
1.29154966501 -0.0813207911703
2.15443469003 -0.0452453448098
3.5938136638 -0.00510536231343
5.99484250319 0.0327295465867
10.0 0.0655838577779
(10.0, 0.065583857777861798)


In [8]:
clf.set_params(alpha=best[0])
clf.fit(x, y)

Ridge(alpha=10.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [9]:
clf.score(x_test, y_test)

0.06545819913249773

In [10]:
# print out some predicted vs. actual values
list(zip(clf.predict(x_test), y_test[-20:]))

[(1.7594321883696065, 5),
 (2.5446481776419683, 2),
 (1.6250822235700111, 2),
 (1.7606811898223107, 3),
 (1.0883171284360318, 4),
 (2.7344527249075918, 4),
 (3.0576007305170365, 5),
 (1.1372971751590499, 4),
 (2.4557353260277579, 1),
 (3.8825840750344978, 0),
 (2.8292044155289315, 0),
 (1.362745192491009, 3),
 (0.62440738983604582, 2),
 (2.0212172305660516, 1),
 (2.479063328146335, 0),
 (2.4937448739382062, 3),
 (1.82199953410074, 2),
 (2.1844813724470371, 2),
 (3.0034227534368925, 2),
 (2.9471367988463322, 1)]

Quantify the performance of other models.
FLowchart for choosing models found here: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [11]:
from sklearn.svm import LinearSVC
model = LinearSVC
clf = model()
clf.fit(x, y)
clf.score(x_test, y_test)

0.22415153906866614

In [12]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
clf = GaussianNB()
clf.fit(x.toarray(), y)
clf.score(x_test.toarray(), y_test)

0.19021310181531176

In [13]:
clf = MultinomialNB()
clf.fit(x, y)
clf.score(x_test, y_test)

0.26045777426992894

In [14]:
clf = BernoulliNB()
clf.fit(x, y)
clf.score(x_test, y_test)

0.27703235990528807

In [15]:
clf = linear_model.Lasso(alpha=0.1)
clf.fit(x, y)
clf.score(x_test, y_test)

0.0012527445600031273

# Conclusion

The paper achieved similar accuracy to my experiments here at about 22% to 27%. It is clear that the core assumption that the truth/fake value of a statement can be inferred from the vocabulary and structure is flawed. It is possible that a model that actually understands English and manners of speech could get better results.