## Who Wrote "The Fatal Conceit"? From A Simple Machine Learning Approach

I analyzed Hayek and his editor William Warren Bartley's writing style (commonly used word using the bag of words) to validate who is more likely be the author of "The Fatal Conceit". 

> There is a scholarly debate on how much influence William Warren Bartley had had on the work "The Fatal Conceit" of Nobel Prize Laureates F.A. Hayek. Officially, Bartley was the editor who prepared the book for publication once Hayek fell ill in 1985. However, the inclusion of material from Bartley's philosophical point of view and citations that other people provided to Bartley have led to questions about how much of the book was written by Hayek and whether Hayek knew about the added material. Bruce Caldwell thinks the evidence "clearly points towards a conclusion that the book was a product more of [Bartley's] pen than of Hayek's. ... Bartley may have written the book". 


### Import libraries

In [None]:
import numpy as np
import nltk
import string
from collections import Counter
from __future__ import division
import random
from nltk.corpus import names
from itertools import izip_longest
import pickle
import glob
from sklearn.feature_extraction.text import CountVectorizer

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

### Create corpus

In [None]:
## I selected 6 books of Hayek and 9 books of his author with the similar topic and similar length. The aim is to compare the writing style and common used word of both, so I did not lemmatize the word nor elimiate traditional stop words.


## I selected a few stop words to elimiate some words which may cause problems due to overfitting and bug format.(eg., individualism, order appear at every page in one of Hayek's book)
stop_words = ['individualism','economic','economics','order','hayek','bartley','popper',"''","``"]

def grouper(n, iterable, fillvalue=None):
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

def get_words(f):
    text = open(f,'r').read()
    words = nltk.word_tokenize(text.decode('utf8'))
    useful_words = []
    for word in words:
        if not word in string.punctuation and not word in stop_words and not (any(i.isdigit() for i in word)) and '.' not in word and '-' not in word:
            useful_words.append(word.lower())
    unique_words = set(useful_words)
    return unique_words


word_corpus = set()

word_corpus_hayek = set()
for i in range(4):
    word_corpus_hayek = word_corpus_hayek.union(get_words('books/hayek/book{}.txt'.format(i)))

word_corpus_editor = set()
editor_files = glob.glob('books/editor/*.txt')
for f in editor_files:
    word_corpus_editor = word_corpus_editor.union(get_words(f))

word_corpus = word_corpus_hayek.union(word_corpus_editor)
word_corpus = list(word_corpus)

### Create Featureset (Note: this may take a quite long time)

In [None]:
## Split each book into pieces of text of length 500. Each one as an observation.

n = 500

def create_featureset(book, author):
    featureset = []
    with open(book,'r') as f:
        for _, chapter in enumerate(grouper(n, f, fillvalue = ''),1):
            current_words = []
            for sector in chapter:
                sector = sector.decode("utf8")
                words_list = nltk.word_tokenize(sector)
                for word in words_list:
                    if not word in string.punctuation and not word in stop_words and not (any(i.isdigit() for i in word)) and '.' not in word and '-' not in word:
                        current_words.append(word)
                        
            feature = np.zeros(len(word_corpus))
            for word in current_words:
                if word in word_corpus:
                    index_value = word_corpus.index(word)
                    feature[index_value] += 1
                    
            featureset.append([feature, author])
    return featureset


# Label the observation as 1 if the author is Hayek, if the author is his editor, label it as 0.

featureset = []

for i in range(4):
    featureset += create_featureset('books/hayek/book{}.txt'.format(i),1)

for f in editor_files:
    featureset += create_featureset(f,0)

random.shuffle(featureset)

X = np.array([featureset[i][0] for i in range(len(featureset))])
y = np.array([featureset[i][1] for i in range(len(featureset))])

## I got 182 oberservations and over 30000 unique word as features. (X.shape (182, 33239))

## Train-test split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)

### Train and Cross-validation

In [None]:
## I chosed an ensemble machine learning algorithm XGBC to train a model and validate the result. The result looks good in the original set, but does not work very well at validation sets (I tried to validate it on other books of Hayek, but the result is not very clear.

clf = XGBClassifier()
clf.fit(X_train, y_train)
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

y_pred = clf.predict(X_test)

In [None]:
## Good result both on test set and cross validation. 

np.mean(y_pred == y_test)
### result: 0.97297297297297303

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator= clf, X = X, y=y, cv = 10)
print accuracies.mean()
print accuracies.std()

### result:0.983625730994, 0.0250235004624

### Test on "The Fatal Conceit"

In [None]:
## I applied the XGBC model to the book "The Fatal Conceit". The result indicates that it is not Hayek (only 4 out of 15 pieces of the book were predicted to be Hayek's writing), rather than his editor wrote this book.

feature_test = create_featureset('books/fatal-conceit.txt',1)
save_f = open('featureset_fatal.pickle','wb')
pickle.dump(feature_test, save_f)
save_f.close()

X_val = np.array([feature_test[i][0] for i in range(len(feature_test))])
y_val = np.ones(len(feature_test))

y_pred = clf.predict(X_val)

In [None]:
y_pred
### result: array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0]

np.mean(y_pred == y_val)
### result: 0.26666666666666666

clf.predict_proba(X_val)
# result:
# array([[ 0.90069646,  0.09930352],
#        [ 0.80879188,  0.19120814],
#        [ 0.96110272,  0.03889726],
#        [ 0.61629498,  0.38370502],
#        [ 0.92610949,  0.07389053],
#        [ 0.93491316,  0.06508683],
#        [ 0.82616472,  0.17383531],
#        [ 0.94665325,  0.05334673],
#        [ 0.01235527,  0.98764473],
#        [ 0.00783384,  0.99216616],
#        [ 0.95124978,  0.04875023],
#        [ 0.97885853,  0.02114148],
#        [ 0.0117451 ,  0.9882549 ],
#        [ 0.01979834,  0.98020166],
#        [ 0.93806189,  0.06193811]], dtype=float32)

However, when I tried to validate the model to other books which were certainly written by Hayek. The result is not good. The book I chose are The Intellectuals and Socialism and The Counter-Revolution of Science. The result is correct on The Intellectuals and Socialism but poorly on The Counter-Revolution of Science.

### Validation on Intellectual and Socialism

In [None]:
feature_test_1 = create_featureset('books/intellectual.txt',1)

X_val_1 = np.array([feature_test_1[i][0] for i in range(len(feature_test_1))])
y_val_1 = np.ones(len(feature_test_1))

y_pred_1 = clf.predict(X_val_1); y_pred_1
### result: array([1, 1])

clf.predict_proba(X_val_1)
# result:
# array([[ 0.0049696 ,  0.9950304 ],
#        [ 0.01759064,  0.98240936]], dtype=float32)

### Validation on Counter Revolution 

In [None]:
feature_test_2 = create_featureset('books/counter.txt',1)

X_val_2 = np.array([feature_test_2[i][0] for i in range(len(feature_test_2))]) 
y_val_2 = np.ones(len(feature_test_2))
y_pred_2 = clf.predict(X_val_2); 

y_pred_2
# result:
# array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1])

np.mean(y_pred_2 == y_val_2) 
# result: 0.660377358490566 

### Different algorithm?
So I tried other ml algorithm, SVC, Decision Tree and Multinomial Bayes.

### SVC

In [None]:
clf_svc = SVC()
clf_svc.fit(X_train, y_train)

clf_svc.predict(X_val)
array([1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0])

## Also predicts that the editor is more likely be the author. But the result is still not coherent on The Counter-Revolution of Science.

accuracies = cross_val_score(estimator= clf_svc, X = X, y=y, cv = 10)
print accuracies.mean()
print accuracies.std()
# 0.951169590643
# 0.056804219319

clf_svc.predict(X_val_1)
# array([1, 1])

clf_svc.predict(X_val_2) 
# array([1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

### Decision tree

In [None]:
clf_tree= DecisionTreeClassifier(criterion='entropy')
clf_tree.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

accuracies = cross_val_score(estimator= clf_tree, X = X, y=y, cv = 10)
print accuracies.mean()
print accuracies.std()
# 0.972807017544
# 0.035979172754

clf_svc.predict(X_val)
# array([1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0])

## Also predicts that the editor is more likely be the author. But the result is not coherent on The Counter-Revolution of Science.

clf_svc.predict(X_val_1)
# array([1, 1])

clf_svc.predict(X_val_2)
# array([1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
#        1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
#        1, 1, 1, 1, 1, 1, 1])

### Multinomial Bayes

In [None]:
clf_Multi = MultinomialNB()
clf_Multi.fit(X_train, y_train)
clf_Multi.predict(X_val)
array([1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

## The Multinomial Bayes algorithm predicts that Hayek is very likely to be the author and the result is coherent on validation book.

accuracies = cross_val_score(estimator= clf_Multi, X = X, y=y, cv = 10)
print accuracies.mean()
print accuracies.std()
# 0.994736842105
# 0.0157894736842

clf_Multi.predict(X_val_1)
# array([1, 1])

clf_Multi.predict(X_val_2)
# array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#        1, 1, 1, 1, 1, 0, 1])

The Multinomial Bayes algorithm predicts that Hayek is very likely to be the author and the result is coherent on validation book.

### Conditional Conclusion: The best-validated algorithm indicates that Hayek is more likely be the author.

### Notes:

However, this result is not very robust. The algorithm should be improved in many ways. 

1. Bag of Word method is not enough to construct the whole feature set, word order, punctuations and other features and better algorithm should be considered. 

2. The format of text file is not completely clean, some formats are bugging. 

3. The available text for Hayeks' editor is proportional low and his subject is more focus on philosophy and religion, although I tried to match the similarity of the text from both author. The text of editor is generally more similar to the test feature from "The Fatal Conceit".  

4. More validation sets should be tested. 

5. Since W.W. Bartley indeed is the editor of this book, so he at least had some influence over the writing style of this book. 