# Introduction 

Author Attribution is a classic problem of NLP which is a part text classification problem. Authorship attribution is a well-studied problem which led to the field of [Stylometry](https://en.wikipedia.org/wiki/Stylometry).  Here, we given a set of documents from certain authors, we train a model to understand the author's style and use this to indentify the author of unknown documents. As with many other NLP problems, it has benefited greatly from the increase in available computer power, data and advanced machine learning techniques.  All of these make authorship attribution a natural candidate for the use of deep learning (DL).  In particular, we can benefit from DL's ability to automatically extract the relevant features for a specific problem.

In this lab we will focus on the following:
1.  Extract chracter level features from text of each author (to get author's style)
2.  Using these features for building a classification model for authorship attribution
3.  Applying the model for identifying the author of a set of unknown documents

As mentioned above, this problem can be solved in three steps. First is feature extraction. Here, since there is a limited amount of data, we are going to use character as our features instead of words or sentences. If we use words or sentences as our features, we are going to end up with small dataset which might be problematic to train our model.

### Features

1. A sequence of characters (length of sequence is a hyperparameter)
2. An embedding layer for characters (dimensionality of the embedding is a hyperparameter)

Embedding layer is a part of our model, but we can definitely consider it as feature extactor, since it encodes the features space into more meaningful semantic space. 

### Classifier

1. Build a classifier using RNN layers and Dense layers. 
2. Choose an Optimizer, learning rate and train the model with extacted features

### Predict

1.  Break the entire document to sequences of the same length, as determined by the hyperparameter
2.  Retrieve an author prediction for each one of these sequences
3.  Determine which author has received more 'votes'.  We will then use this author as our prediction for the entire document.  (Note:  in order to have a clear majority, we need to ensure that the number of sequences is odd).



## Prepare the data

We begin by setting up the data pre-processing pipeline.  For each one of the authors, we aggregate all the known papers into a single long text.  We assume that style does not change across the various papers, hence a single text is equivalent to multiple small ones yet it is much easier to deal with programmatically.

For each paper of each author we perform the following steps:
1. Convert all text into lower-case (ignoring the fact that capitalization may be a stylistic property)
2. Converting all newlines and multiple whitespaces into single whitespaces
3. Remove any mention of the authors' names, otherwise we risk data leakage (authors names are hamilton and madison)

Do the above steps in a function as it is needed for predicting the unknown papers.

In [1]:
import numpy as np
import os
from sklearn.model_selection import train_test_split

# Classes for A/B/Unknown
A = 0
B = 1
UNKNOWN = -1


def preprocess_text(file_path):

    with open(file_path, 'r') as f:
        lines = f.readlines()
        text = ' '.join(lines[1:]).replace("\n", ' ').replace('  ',' ').lower().replace('hamilton','').replace('madison', '')
        text = ' '.join(text.split())
        return text


# Concatenate all the papers known to be written by A/B into a single long text
all_authorA, all_authorB = '',''
for x in os.listdir('./papers/A/'):
    all_authorA += preprocess_text('./papers/A/' + x)

for x in os.listdir('./papers/B/'):
    all_authorB += preprocess_text('./papers/B/' + x)
    
# Print lengths of the large texts
print("AuthorA text length: {}".format(len(all_authorA)))
print("AuthorB text length: {}".format(len(all_authorB)))

AuthorA text length: 216394
AuthorB text length: 230867


The next step is to break the long text for each author into many small sequences.  As described above, we empirically choose a length for the sequence and use it throughout the model's lifecycle.  We get our full dataset by labeling each sequence with its author.

To break the long texts into smaller sequences we use the *Tokenizer* class from the Keras framework.  In particular, note that we set it up to tokenize according to *characters* and not words.

1. Choose SEQ_LEN hyper parameter, this might have to be changed if the model doesn't fit well to training data. 
2. Write a function make_subsequences to turn each document into sequences of length SEQ_LEN and give it a correct label.
3. Use keras Tokenizer with char_level=True
4. fit the tokenizer on all the texts
5. Use this tokenizer to convert all texts into sequences using texts_to_sequences()
6. Use make_subsequences() to turn these sequences into appropriate shape and length

In [2]:
from keras.preprocessing.text import Tokenizer


# Hyperparameter - sequence length to use for the model
SEQ_LEN = 30


def make_subsequences(long_sequence, label, sequence_length=SEQ_LEN):

    len_sequences = len(long_sequence)
    X = np.zeros(((len_sequences - sequence_length)+1, sequence_length))
    y = np.zeros((X.shape[0], 1))
    for i in range(X.shape[0]):
        X[i] = long_sequence[i:i+sequence_length]
        y[i] = label
    return X,y
        
# We use the Tokenizer class from Keras to convert the long texts into a sequence of characters (not words)

tokenizer = Tokenizer(char_level=True)

# Make sure to fit all characters in texts from both authors
tokenizer.fit_on_texts(all_authorA + all_authorB)

authorA_long_sequence = tokenizer.texts_to_sequences([all_authorA])[0]
authorB_long_sequence = tokenizer.texts_to_sequences([all_authorB])[0]

# Convert the long sequences into sequence and label pairs
X_authorA, y_authorA = make_subsequences(authorA_long_sequence, A)
X_authorB, y_authorB = make_subsequences(authorB_long_sequence, B)

# Print sizes of available data
print("Number of characters: {}".format(len(tokenizer.word_index)))
print('author A sequences: {}'.format(X_authorA.shape))
print('author B sequences: {}'.format(X_authorB.shape))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Number of characters: 52
author A sequences: (216365, 30)
author B sequences: (230838, 30)


Compare the number of raw characters to the number of labeled sequences for each author.  Deep Learning requires many examples of each input.  The following code calculates the number of total and unique words in the texts.

In [3]:
# Calculate the number of unique words in the text

word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts([all_authorA, all_authorB])

print("Total word count: ", len((all_authorA + ' ' + all_authorB).split(' ')))
print("Total number of unique words: ", len(word_tokenizer.word_index))

Total word count:  74349
Total number of unique words:  6318


We now proceed to create our train, validation sets.  

1. Stack x data together and y data together
2. use train_test_split to split the dataset into 80% training and 20% validation
3. Reshape the data to make sure that they are sequences of correct length

In [4]:
# Take equal amounts of sequences from both authors
X = np.vstack((X_authorA, X_authorB))
y = np.vstack((y_authorA, y_authorB))

# Break data into train and test sets
X_train, X_val, y_train, y_val = train_test_split(X,y, train_size=0.8)

# Data is to be fed into RNN - ensure that the actual data is of size [batch size, sequence length]
X_train = X_train.reshape(-1, SEQ_LEN)
X_val =  X_val.reshape(-1, SEQ_LEN) 

# Print the shapes of the train, validation and test sets
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

print("X_validate shape: {}".format(X_val.shape))
print("y_validate shape: {}".format(y_val.shape))

X_train shape: (357762, 30)
y_train shape: (357762, 1)
X_validate shape: (89441, 30)
y_validate shape: (89441, 1)


Finally, we construct the model graph and perform the training procedure.

1. Create a model using RNN and Dense layers
2. Since its a binary classification problem, the output layer should be Dense with sigmoid activation
3. Compile the model with optimizer, appropriate loss function and metrics
4. Print the summary of the model

In [5]:
from keras.layers import SimpleRNN, Embedding, Dense
from keras.models import Sequential
from keras.optimizers import SGD, Adadelta, Adam
Embedding_size = 100
RNN_size = 256

model = Sequential()
model.add(Embedding(len(tokenizer.word_index)+1, Embedding_size, input_length=30))
model.add(SimpleRNN(RNN_size, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics = ['accuracy'])
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 30, 100)           5300      
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 256)               91392     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 96,949
Trainable params: 96,949
Non-trainable params: 0
_________________________________________________________________


1. Decide upon the batch size, epochs and train the model using training data and validate with vailadation data
2. Based on the results, go back to model above, change it if needed ( use more layers, use regularization, dropout, etc., use different optimizer, or a different learning rate, etc.)
3. Change Batch size, epochs if needed

In [6]:
Batch_size = 4096
Epochs = 20
model.fit(X_train, y_train, batch_size=Batch_size, epochs=Epochs, validation_data=(X_val, y_val))

Train on 357762 samples, validate on 89441 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x20f3a8d9ef0>

### Applying the Model to the Unknown Papers

Do this all the papers in the Unknown folder
1. preprocess them same way as training set (lower case, removing white lines, etc.)
2. use tokenizer and make_subsequences function above to turn them into sequences of required size
3. Use the model to predict on these sequences.
4. Count the number of sequences assigned to author A and the ones assigned to author B
5. Based on the count, pick the author with highest votes/count

In [7]:
for x in os.listdir('./papers/Unknown/'):
    unknown = preprocess_text('./papers/Unknown/' + x)
    unknown_long_sequences = tokenizer.texts_to_sequences([unknown])[0]
    X_sequences, _ = make_subsequences(unknown_long_sequences, UNKNOWN)
    X_sequences = X_sequences.reshape((-1,SEQ_LEN))
    
    votes_for_authorA = 0
    votes_for_authorB = 0
    
    y = model.predict(X_sequences)
    y = y>0.5
    votes_for_authorA = np.sum(y==0)
    votes_for_authorB = np.sum(y==1)
    
    
    print("Paper {} is predicted to have been written by {}, {} to {}".format(
                x.replace('paper_','').replace('.txt',''), 
                ("Author A" if votes_for_authorA > votes_for_authorB else "Author B"),
                max(votes_for_authorA, votes_for_authorB), min(votes_for_authorA, votes_for_authorB)))
    

Paper 1 is predicted to have been written by Author B, 11946 to 8828
Paper 2 is predicted to have been written by Author B, 11267 to 8379
Paper 3 is predicted to have been written by Author B, 6738 to 6646
Paper 4 is predicted to have been written by Author A, 5254 to 4519
Paper 5 is predicted to have been written by Author A, 6570 to 5184


# Summary

In this lab, we discussed the problem of authorship attribution.  Finally, we looked at the model internals to get an intuition for how the it encodes stylometric properties.

The first two papers are written by author B, and next three papers are written by author A.

The model was able capture the style of each author based on the character sequences given to it. This is a hyper parameter which needs to be tuned. So, play with this parameter as a part of feature extarcation stage.

Finally, you are able to train a model to solve author attribution.

Good luck for your next lessons! Do try this assigment with the layers you learn next and see if there is any improvement in the model.