## Project Overview

My goal is build a classifier that predicts the author of a given book. Here are a few key steps in making the classifier.

1. Read in all novels (6 in total), identify the most frequently used words as top_words_set
2. Read in a novel, break it up into chunks of the same length
3. For each chunk, count the frequencies of words that belong to top_words_set
4. Create a data frame, each row corresponds to a chunk, the features are the counts, label with author_id
5. Perform 2-4 for each novel, and stack up the data frames into one data frame that contains all novels (6 in total)
6. create train and test set from the data frame
7. use a classifer (Random Forest chosen) to learn the train set and calculate accuracy score
8. read in a new novel (different from 6 novels above) and convert it to a data frame using the same frequency counting technique
9. predict the author of the new novel


## Code

In [122]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
from scipy.stats import mode

## define functions that create the features (word-frequency count)

In [None]:
global top_num, chunk_size
top_num = 200
chunk_size = 2000


def find_top_words_freq(words_stream, top_words_set):
    stream_freq = dict([(top_word,0) for top_word in top_words_set])
    for word in words_stream:
        if word in top_words_set:
            stream_freq[word] += 1
    return stream_freq

def define_top_words_set(novel):
    total_num_words = len(novel)
    
    word_freq = {}
    words = re.split('\W+', novel)
    
    for word in words:
        if word not in word_freq:
            word_freq[word] = 1
        else:
            word_freq[word] += 1

    word_freq_list = [pair for pair in word_freq.iteritems()]
    word_freq_list_sorted = sorted(word_freq_list ,key=lambda x:-x[1])
    word_freq_list_top = word_freq_list_sorted[:top_num]
    top_words_set = set(dict(word_freq_list_top).keys())
    
    return top_words_set

def create_freq(novel, top_words_set):
    
    words = re.split('\W+', novel)
    words_collection = [words[i:i+chunk_size] for i in range(0, len(words), chunk_size)]

    freq_dfs = [pd.DataFrame(find_top_words_freq(collection, top_words_set), index = ['0']) for collection in words_collection]
    freq = pd.concat(freq_dfs).reset_index().drop('index',axis=1)

    return freq

## create feature data frames

In [None]:
with open('./author_id/DavidHerbertLawrence_SonsandLovers.txt') as f:
    novel1 = f.read()
    
with open('./author_id/GeorgeMacDonald_AttheBackoftheNorthWind.txt') as f:
    novel2 = f.read()
    
with open('./author_id/HenryJames_TheAspernPapers.txt') as f:
    novel3 = f.read()
    
with open('./author_id/JosephConrad_HeartofDarkness.txt') as f:
    novel4 = f.read()

with open('./author_id/LewisCarroll_ThroughtheLooking-Glass.txt') as f:
    novel5 = f.read()
    
with open('./author_id/MarkTwain_AdventuresofHuckleberryFinn.txt') as f:
    novel6 = f.read()
    

top_words_set1 = define_top_words_set(novel1)
top_words_set2 = define_top_words_set(novel2)
top_words_set3 = define_top_words_set(novel3)
top_words_set4 = define_top_words_set(novel4)
top_words_set5 = define_top_words_set(novel5)
top_words_set6 = define_top_words_set(novel6)
top_words_set_all = set.intersection(top_words_set1, top_words_set2, top_words_set3, top_words_set4, top_words_set5, top_words_set6)
top_words_set = top_words_set_all

freq1 = create_freq(novel1, top_words_set)
freq1['author_id'] = 1


freq2 = create_freq(novel2, top_words_set)
freq2['author_id'] = 2
freq2.head()


freq3 = create_freq(novel3, top_words_set)
freq3['author_id'] = 3
freq3.head()


freq4 = create_freq(novel4, top_words_set)
freq4['author_id'] = 4
freq4.head()


freq5 = create_freq(novel5, top_words_set)
freq5['author_id'] = 5
freq5.head()


freq6 = create_freq(novel6, top_words_set)
freq6['author_id'] = 6
freq6.head()

author_dict = {1:'David Herbert Lawrence', 2:'George MacDonald', 3:'Henry James', 4:'Joseph Conrad', 5:'Lewis Carroll', 6:'Mark Twain'}

## create train and test sets

In [200]:
freq_all = pd.concat([freq1,freq2,freq3,freq4,freq5,freq6])

freq_all.reset_index().drop('index', axis=1)

df_train = pd.concat([train_test_split(freqX, test_size=0.3, random_state=42)[0] for freqX in [freq1,freq2,freq3,freq4,freq5,freq6]])

df_test = pd.concat([train_test_split(freqX, test_size=0.3, random_state=42)[1] for freqX in [freq1,freq2,freq3,freq4,freq5,freq6]])

X_train = np.array(df_train.iloc[:, 0:len(top_words_set)])

X_test = np.array(df_test.iloc[:, 0:len(top_words_set)])

y_train = np.array(df_train.iloc[:, len(top_words_set)])

y_test = np.array(df_test.iloc[:, len(top_words_set)])

# classifiers

In [214]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=500)
rfc.fit(X_train, y_train)
accuracy_score(y_test, rfc.predict(X_test))

0.93243243243243246

In [223]:
with open('./author_id/HenryJames_TheTurnoftheScrew.txt') as f:
    novelTest = f.read()
freqTest = create_freq(novelTest, top_words_set)

author_dict[mode(rfc.predict(freqTest))[0][0]]

'Henry James'

In [222]:
with open('./author_id/MarkTwain_TheAdventuresofTomSawyer.txt') as f:
    novelTest = f.read()
freqTest = create_freq(novelTest, top_words_set)

author_dict[mode(rfc.predict(freqTest))[0][0]]

'Mark Twain'

## Summary

The classifier has a reasonable accuracy score given its simplicity. More importantly, it worked on other novels(not used in training at all) by the same writer, indicating the style of writing is similar for one writer.

With more time, chunk_size and number of top words could potentially improve the accurarcy score. More advanced features such as N-grams should help as well.