# Author Attribution

In this notebook I will predict the author of a given text from federalist.csv. It is a csv file of text and it's given author whom can be Hamilton, Jay, or Madison. Sometimes a mixture of the authors as well. I will utlize pandas for the data processing, NLP for the word processing, and then sklearn to perform Bernoulli Naive Bayes, Logistic Regression, and Neural Network.

## Reading and Processing the Data
### With Pandas

In [1]:
# Importing Pandas
import pandas as pd
# Reading in the csv file with pandas
df = pd.read_csv('federalist.csv')
# Converting the author column to categorical data
df.author = df.author.astype('category')
# Displaying the first few rows of the data frame
df.head()

Unnamed: 0,author,text
0,HAMILTON,FEDERALIST. No. 1 General Introduction For the...
1,JAY,FEDERALIST No. 2 Concerning Dangers from Forei...
2,JAY,FEDERALIST No. 3 The Same Subject Continued (C...
3,JAY,FEDERALIST No. 4 The Same Subject Continued (C...
4,JAY,FEDERALIST No. 5 The Same Subject Continued (C...


### Utilizing sklearn to create train/test data frames

In [2]:
# Import sklearn's train_test_split
from sklearn.model_selection import train_test_split
# Divide into train and test (80/20 with seed 1234 for replicable results)
# X contains the predictor columns and y contains the target column
X_train, X_test, y_train, y_test = train_test_split(
    df[['text']], df[['author']], test_size=0.2, random_state=1234,
    stratify=df[['author']])

# Outputting the dimensions of train and test
print("Dimensions of train data frame: ", X_train.shape)
print("Dimensions of test data frame: ", X_test.shape)

Dimensions of train data frame:  (66, 1)
Dimensions of test data frame:  (17, 1)


### Removing stop words

In [3]:
# Importing the nltk stopwords
from nltk.corpus import stopwords
# This is our set of stopwords, it will be used during vectorization
stopwords = set(stopwords.words('English'))

OSError: No such file or directory: '/home/bridgette/nltk_data/corpora/stopwords/English'

### Performing tf-idf Vectorization

In [None]:
# Importing our tf-idf vectorizer from sklearn
from sklearn.feature_extraction.txt import TfidfVectorizer
# Setting up our stopwords for our vectorizer
vectorizer = TfidfVectorizer(stop_words=stopwords)
# Perform tf-idf vectorization and fit to training data
X_train_vect = vectorizer.fit_transform(X_train)
# Transforming the test data with the fitted tf-idf vectorization
X_test_vect = vectorizer.transform(X_test)
# Outputting the dimensions of train and test
print("Dimensions of train data frame: ", X_train_vect.shape)
print("Dimensions of test data frame: ", X_test_vect.shape)

In [None]:
# Trying Benoulli Naive Bayes model

In [None]:
# Limiting the number of frequent words and adding bigrams
# to improve vectorization for train and test

In [None]:
# Trying Benoulli Naive Bayes model again

In [None]:
# Try Logistic Regression

In [None]:
# Try Logistic Regression with _____

In [None]:
# Neural Network 1

In [None]:
# Neural Network 2

In [None]:
# Neural Network 3