# MANIC's entry to FNC-1

This model was our entry into FNC-1. Read more about the competition at fakenewschallenge.org.

This model placed 10th of 50 entrants with a score of 9243/11650.5

First let's set the detail of our model (For our final submission we used n_estimators=500)

In [1]:
detail = 500

First we load our training data ("train_bodies.csv" and "train_stances.csv" are located in res/fnc/fnc-1)

In [2]:
from res.fnc.utils.dataset import DataSet
d = DataSet(path="res/fnc/fnc-1/")

Reading dataset
Total stances: 49972
Total bodies: 1683


Next we split our data into a training set (~80%) and holdout set (~20%). The training set is split into 10 folds of ~8% each.

In [3]:
from res.fnc.utils.generate_test_splits import kfold_split, get_stances_for_folds

folds,holdout = kfold_split(d, n_folds=10)
fold_stances, hold_out_stances = get_stances_for_folds(d, folds, holdout)

Now we generate the features of our data and construct a feature vector for each headline-body pair. The feature vector has length 152.

In [4]:
from res.fnc.fnc_kfold import generate_features
Xs = dict()
ys = dict()

for fold in fold_stances:
    Xs[fold],ys[fold] = generate_features(fold_stances[fold],d,str(fold))
    print ('Features finished for fold ', fold)
X_holdout,y_holdout = generate_features(hold_out_stances,d,"holdout")
print ('Features finished for holdout set')
print(len(X_holdout[0]))
#print(X_holdout[0])

Features finished for fold  0
Features finished for fold  1
Features finished for fold  2
Features finished for fold  3
Features finished for fold  4
Features finished for fold  5
Features finished for fold  6
Features finished for fold  7
Features finished for fold  8
Features finished for fold  9
Features finished for holdout set
152


During development we used the 10 folds for cross-validation, but we're done with development.

Now let's smush all the folds together into one training set and check its score on the holdout set.

First we merge the folds into one array

In [5]:
import numpy as np

X_arrays = [np.array(Xs[fold]) for fold in Xs]
y_arrays = [np.array(ys[fold]) for fold in ys]
X_train = np.concatenate(X_arrays)
y_train = np.concatenate(y_arrays)

Now we train a Random Forest Classifier to seperate the data between "Unrelated" and "Discuss/Disagree/Agree"

In [6]:
from sklearn.ensemble import RandomForestClassifier

y_relation_train = [0 if a < 3 else 3 for a in y_train]
relationModel = RandomForestClassifier(n_estimators=detail, random_state=42)
relationModel.fit(X_train, y_relation_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

Let's see how the relation sort performs on the holdout set

In [7]:
y_relation_holdout = [0 if y < 3 else 3 for y in y_holdout]
score = relationModel.score(X_holdout, y_relation_holdout)
print("Relation Sort Accuracy: %0.3f" % score)

Relation Sort Accuracy: 0.976


Now we train a Random Forest Classifier to seperate the data between "Discuss" and "Disagree/Agree"

In [8]:
X_neu_train = [x for i, x in enumerate(X_train) if y_train[i] != 3]
y_neu_train = [0 if y < 2 else 2 for y in y_train if y != 3]
neutralModel = RandomForestClassifier(n_estimators=detail, random_state=42)
neutralModel.fit(X_neu_train, y_neu_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

Let's see how the neutrality sort performs on the holdout set

In [9]:
X_neu_holdout = [x for i, x in enumerate(X_holdout) if y_holdout[i] != 3]
y_neu_holdout = [0 if y < 2 else 2 for y in y_holdout if y != 3]
score = neutralModel.score(X_neu_holdout, y_neu_holdout)
print("Neutrality Sort Accuracy: %0.3f" % score)

Neutrality Sort Accuracy: 0.780


Now we train a Random Forest Classifier to seperate the data between "Disagree" and "Agree"

In [10]:
X_val_train = [x for i, x in enumerate(X_train) if y_train[i] < 2]
y_val_train = [y for y in y_train if y < 2]
valenceModel = RandomForestClassifier(n_estimators=detail, random_state=42)
valenceModel.fit(X_val_train, y_val_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

Let's see how the valence sort performs on the holdout set

In [11]:
X_val_holdout = [x for i, x in enumerate(X_holdout) if y_holdout[i] < 2]
y_val_holdout = [y for y in y_holdout if y < 2]
score = valenceModel.score(X_val_holdout, y_val_holdout)
print("Valence Sort Accuracy: %0.3f" % score)

Valence Sort Accuracy: 0.840


Now let's combine all three models together and see how we did!

In [12]:
from res.fnc.utils.score import report_score, LABELS, score_submission

pred_rel = relationModel.predict(X_holdout)
pred_neu = neutralModel.predict(X_holdout)
pred_val = valenceModel.predict(X_holdout)
pred = [3 if pred_rel[i] == 3 else (2 if pred_neu[i] == 2 else (1 if pred_val[i] == 1 else 0)) for i in range(0, len(X_holdout))]

predicted = [LABELS[int(a)] for a in pred]
actual = [LABELS[int(a)] for a in y_holdout]
report_score(actual,predicted)

-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |    384    |     1     |    339    |    38     |
-------------------------------------------------------------
| disagree  |    71     |    15     |    66     |    10     |
-------------------------------------------------------------
|  discuss  |    129    |    15     |   1548    |    108    |
-------------------------------------------------------------
| unrelated |    10     |     0     |    64     |   6824    |
-------------------------------------------------------------
Score: 3808.25 out of 4448.5	(85.6075081488142%)


85.6075081488142

Alright, fast-forward to when the FNC-1 organizers release the test data set.

This time, we trained on ALL of the training data, and we'll see how we do predicting the official test set.

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier

X_arrays = [np.array(Xs[fold]) for fold in Xs]
X_arrays.append(X_holdout)
y_arrays = [np.array(ys[fold]) for fold in ys]
y_arrays.append(y_holdout)
X_train = np.concatenate(X_arrays)
y_train = np.concatenate(y_arrays)

y_rel_train = [0 if y < 3 else 3 for y in y_train]

X_neu_train = [x for i, x in enumerate(X_train) if y_train[i] < 3]
y_neu_train = [0 if y < 2 else 2 for y in y_train if y < 3]

X_val_train = [x for i, x in enumerate(X_train) if y_train[i] < 2]
y_val_train = [y for y in y_train if y < 2]

relationModel = RandomForestClassifier(n_estimators=detail, random_state=42)
neutralModel = RandomForestClassifier(n_estimators=detail, random_state=42)
valenceModel = RandomForestClassifier(n_estimators=detail, random_state=42)

In [14]:
relationModel.fit(X_train, y_rel_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

In [15]:
neutralModel.fit(X_neu_train, y_neu_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

In [16]:
valenceModel.fit(X_val_train, y_val_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

Here we load the official test datasets, ("test_bodies.csv" and "test_stances.csv" are in res/fnc/eval)

In [17]:
test_d = DataSet(path="./res/fnc/eval")

Reading dataset
Total stances: 25413
Total bodies: 904


Now we generate the feature vectors for the headline-body pairs in the test data

In [18]:
X_test, y_test = generate_features(test_d.stances, test_d, "finaltrial")

Now let's see our final competition score!

In [19]:
pred_rel = relationModel.predict(X_test)
pred_neu = neutralModel.predict(X_test)
pred_val = valenceModel.predict(X_test)

pred = [3 if pred_rel[i] == 3 else (2 if pred_neu[i] == 2 else (1 if pred_val[i] == 1 else 0)) for i in range(0, len(X_test))]
        
predicted = [LABELS[int(a)] for a in pred]
actual = [LABELS[int(a)] for a in y_test]
report_score(actual,predicted)

-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |    502    |     1     |   1230    |    170    |
-------------------------------------------------------------
| disagree  |    157    |     0     |    382    |    158    |
-------------------------------------------------------------
|  discuss  |    358    |     2     |   3717    |    387    |
-------------------------------------------------------------
| unrelated |    32     |     0     |    300    |   18017   |
-------------------------------------------------------------
Score: 9255.75 out of 11651.25	(79.43997425168973%)


79.43997425168973