# Author Attribution with SKLearn
*The Federalist Papers* is a collection of documents written by Alexander Hamilton, James Madison, and John Jay collectively under the pseudonym Publius. These documents were written to persuade voters to ratify the US Constitution. These documents continue to be influential to this day, as they are frequently cited in Federal court rulings, as well as law blogs, and political opinions.

Here, we use *The Federlist Papers* as means to demonstrate NLP authorship attribution, as in the attempt to identify the author of a document, given samples of the authors' work.

### Setting up

First, I use pandas' built in method to read in a csv file from my GitHub.
I have also displayed the first few rows of data.

In [1]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/KaeCan/NLP_Portfolio/main/SKLearn/federalist.csv")
df['author'] = pd.Categorical(df['author'])
df.head()

Unnamed: 0,author,text
0,HAMILTON,FEDERALIST. No. 1 General Introduction For the...
1,JAY,FEDERALIST No. 2 Concerning Dangers from Forei...
2,JAY,FEDERALIST No. 3 The Same Subject Continued (C...
3,JAY,FEDERALIST No. 4 The Same Subject Continued (C...
4,JAY,FEDERALIST No. 5 The Same Subject Continued (C...


We can use pandas to see how many papers each author has or potential has written.

In [2]:
df.groupby(['author'])['author'].count()

author
HAMILTON                49
HAMILTON AND MADISON     3
HAMILTON OR MADISON     11
JAY                      5
MADISON                 15
Name: author, dtype: int64

I use sklearn's train and test split to divide our data into 80% train and 20% test. I outputted the shapes of each below.

In [3]:
from sklearn.model_selection import train_test_split

X = df.text
Y = df.author

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, train_size=0.8, random_state=1234)

print(X_train.shape)
print(X_test.shape)


(66,)
(17,)


We can further refine our text by preprocessing the words. I have removed stop words and performed tf-idf vectorization. I have re-outputted the shapes of the training and test set.

In [4]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words=stopwords)

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape)
print(X_test.shape)


(66, 7876)
(17, 7876)


## Bernoulli Naive Bayes Model
The first model I use is the Bernoulli Naive Bayes model. Each model I will output the model's prediction accuracy.

In [5]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score

naive_bayes = BernoulliNB()
naive_bayes.fit(X_train, Y_train)

prediction = naive_bayes.predict(X_test)
print('accuracy: ', accuracy_score(Y_test, prediction))

accuracy:  0.5882352941176471


This model gives us a disappointing 59% accuracy. I understand there are 7876 unique words in the vocabulary. This may be too much, and many of those words may not be helpful. I attempt to rectify this by further refining our vectorization to include only the 1000 most frequent words and add bigrams as a feature.

In [6]:
vectorizer = TfidfVectorizer(
    stop_words=stopwords,
    max_features=1000,
    ngram_range=(1,2)
)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, random_state=1234)

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

naive_bayes2 = BernoulliNB()
naive_bayes2.fit(X_train, Y_train)

prediction = naive_bayes2.predict(X_test)
print('accuracy: ', accuracy_score(Y_test, prediction))


accuracy:  0.9411764705882353


As we can see, our accuracy has drastically improved from the refinements we've made to our vectorization.

## Logistic Regression Model
Next, I try to use logistic regression to fit the data. I began with a control model that has no parameters.

In [7]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train, Y_train)

prediction = log_reg.predict(X_test)
print('accuracy: ', accuracy_score(Y_test, prediction))

accuracy:  0.5882352941176471


Again, our accuracy is disappointing. However, we can adjust parameters to see if we can improve the accuracy.

In [8]:
log_reg2 = LogisticRegression(class_weight='balanced', C=10000)
log_reg2.fit(X_train, Y_train)

prediction = log_reg2.predict(X_test)
print('accuracy: ', accuracy_score(Y_test, prediction))

accuracy:  0.8235294117647058


By tinkering with C (regularization strength) and balancing the data, I was able to improve the model's accuracy by a significant degree.

## Neural Networks
Finally, I will attempt to train using a neural network.

In [11]:
from sklearn.neural_network import MLPClassifier

neural = MLPClassifier(hidden_layer_sizes=(10,8), max_iter=10000)
neural.fit(X_train, Y_train)

prediction = neural.predict(X_test)
print('accuracy: ', accuracy_score(Y_test, prediction))

accuracy:  0.8823529411764706


It took a few trials, but I've managed to obtain a final accuracy of 88% by tinkering with different topologies. 