This notebook goes takes the federalist papers and try to create a machine learning algorithm that can predict the author. 

In [1]:
# 1. Read in the csv file using pandas.
# Convert the author column to categorical data.
# Display the first few rows.
# Display the counts by author.

import pandas as pd # Load the Pandas libraries with alias 'pd'
df = pd.read_csv("federalist.csv")
df['author'] = pd.Categorical(df.author)

In [2]:
# Preview the first 5 lines of the loaded data
print(df.head())
print("\n")
print(df.author.value_counts())

     author                                               text
0  HAMILTON  FEDERALIST. No. 1 General Introduction For the...
1       JAY  FEDERALIST No. 2 Concerning Dangers from Forei...
2       JAY  FEDERALIST No. 3 The Same Subject Continued (C...
3       JAY  FEDERALIST No. 4 The Same Subject Continued (C...
4       JAY  FEDERALIST No. 5 The Same Subject Continued (C...


HAMILTON                49
MADISON                 15
HAMILTON OR MADISON     11
JAY                      5
HAMILTON AND MADISON     3
Name: author, dtype: int64


In [3]:
# 2. Divide into train and test, with 80% in train. Use random state 1234.
# Display the shape of train and test.

# divide into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.text, df.author, test_size=0.2, random_state = 1234)

In [4]:
# Display the shape of train and test.
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(66,) (17,) (66,) (17,)


In [5]:
# 3. Process the text by removing stop words and performing tf-idf vectorization
# Output the training set shape and the test set shape.

# removing stop words and performing tf-idf vectorization
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stopwords)

# vectorize
X_train_fit = vectorizer.fit_transform(X_train) # returns document term matrix
X_test_fit = vectorizer.transform(X_test)

print("Train and test sizes (shapes): ", X_train.shape, X_test.shape)
print("peek the data:\n", X_train_fit.toarray(), '\n\n', X_test_fit.toarray())
# print("peek the data:\n", X_train,'\n\n', X_test)
# print("peek the labels:\n", y_train,'\n\n', y_test)

Train and test sizes (shapes):  (66,) (17,)
peek the data:
 [[0.         0.         0.02956872 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.02275824 0.         0.        ]] 

 [[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.02314673 0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cfran\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
# 4. Bernoulli Naïve Bayes model. What is your accuracy on the test set?
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_fit, y_train)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# evaluate on the test data
# make predictions on the test data
pred = naive_bayes.predict(X_test_fit)

print('accuracy score: ', accuracy_score(y_test, pred))

accuracy score:  0.5882352941176471


This accuracy is very low, at just below 60 percent. Let's see if the problem is the vectorization. 

In [7]:
# 5. Redo the vectorization with max_features option set to use only the 1000 most frequent
# In addition to the words, add bigrams as a feature.
# Try Naïve Bayes again on the new train/test vectors and compare your results.

# new vectorization
vectorizer2 = TfidfVectorizer(stop_words=stopwords, max_features=1000, ngram_range=(1, 2))
X_train_fit2 = vectorizer2.fit_transform(X_train) # returns document term matrix
X_test_fit2 = vectorizer2.transform(X_test)

naive_bayes.fit(X_train_fit2, y_train)
pred = naive_bayes.predict(X_test_fit2)

print('accuracy score: ', accuracy_score(y_test, pred))

accuracy score:  0.5882352941176471


It looks like the acciracy did not change - the vectorization did not change much about the final accuracy. Let's now look at a different algorithm: logistic regression. 

In [8]:
# 6. Try logistic regression. Adjust at least one parameter in the LogisticRegression() mode
# to see if you can improve results over having no parameters.
# What are your results?
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train_fit2, y_train)
pred = classifier.predict(X_test_fit2)
print('accuracy score without params:\t', accuracy_score(y_test, pred))

# Change parameters 
classifier = LogisticRegression(multi_class='multinomial', solver='lbfgs', class_weight='balanced')
classifier.fit(X_train_fit2, y_train)
pred = classifier.predict(X_test_fit2)
print('accuracy score using params:\t', accuracy_score(y_test, pred))

accuracy score without params:	 0.5882352941176471
accuracy score using params:	 0.7647058823529411


The accuracy is much improved with a logistic regression with a parameter! However, we can do better. Is there a neural network that can outperform this?

In [9]:
# 7. Neural Network
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(15, 2), random_state=1234)
classifier.fit(X_train_fit2, y_train)

pred = classifier.predict(X_test_fit2)
# start with straightforward design
print('15, 2 accuracy:\t', accuracy_score(y_test, pred))

15, 2 accuracy:	 0.5882352941176471


The accuracy of this most basic network is not great; how about we adjust the architecture of the network and see if changing the number of layers and nodes per layer helps. 

In [10]:
# try other topologies, first 1 layer
import warnings 
warnings.filterwarnings("ignore")

one_layers = []
for i in range(1, 66):
    classifier = MLPClassifier(solver='lbfgs', max_iter=500, alpha=1e-5, hidden_layer_sizes=(i), random_state=1234)
    classifier.fit(X_train_fit2, y_train)
    pred = classifier.predict(X_test_fit2)
    one_layers.append((i, accuracy_score(y_test, pred)))
    #print(i, 'accuracy:\t', accuracy_score(y_test, pred))
print(max(one_layers, key=lambda x: x[1]))

# then 2 layers
two_layers = []
pairs = []
for x in range(1, 66):
    for y in range(1,10): 
        pairs.append((x, y))

for i in pairs:
    classifier = MLPClassifier(solver='lbfgs', max_iter=500, alpha=1e-5, hidden_layer_sizes=(i), random_state=1234)
    classifier.fit(X_train_fit2, y_train)
    pred = classifier.predict(X_test_fit2)
    two_layers.append((i, accuracy_score(y_test, pred)))
    #print(i, 'accuracy:\t', accuracy_score(y_test, pred))
print(max(two_layers, key=lambda x: x[1]))


(2, 0.8235294117647058)
((21, 7), 0.8823529411764706)


The best neural network here is 2 layers with 21 nodes in the first layer and 7 in the second. 

This is the best accuracy we've had on any of the tests, thus it would be my choice moving forward. 