# Author Identification from the Text


This dataset is taken from Kaggle competetion on Spooky Author Identification. It contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. The data was prepared by chunking larger texts into sentences using CoreNLP's MaxEnt sentence tokenize. The objective is to accurately identify the author of the sentences in the training Dataset and Validate with the splitted data as test data.

In [3]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

In [4]:
from sklearn.metrics import  accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

In [5]:
au_train=pd.read_csv("C:\\python\\study\\dataset\\text mining_personalised medicine\\authhor classification\\all\\train.csv")

In [6]:
au_test=pd.read_csv("C:\\python\\study\\dataset\\text mining_personalised medicine\\authhor classification\\all\\test.csv")

Let's try to explore our Data

In [7]:
au_train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [8]:
au_train.shape

(19579, 3)

We have 19579 rows of labled Text. Let's explore Let's explore the Text columns more.

In [9]:
au_train.text[0]

'This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.'

In [10]:
au_train.text.map(lambda x:len(x.split())).head()# find number of words in first few rows

0    41
1    14
2    36
3    34
4    27
Name: text, dtype: int64

In [11]:
au_train.text.map(lambda x:len(x.split())).mean()

26.730476530977068

So it seems all the rows consists of small sentences from different works of the three authors.

In [12]:
cv=CountVectorizer()

In [13]:
# converting The text to a Matrix with the unique words as columns, with each rows has a number for frequency of occurences
au_train_vect=cv.fit_transform(au_train.text) 

In [14]:
au_train_vect.shape

(19579, 25068)

In [15]:
# Creating Data and labels
Y=au_train.author
X=au_train_vect

In [16]:
Y.shape,X.shape

((19579,), (19579, 25068))

In [17]:
xtrain,xtest,ytrain,ytest=train_test_split(X,Y)

In [18]:
xtrain.shape,xtest.shape,ytrain.shape,ytest.shape

((14684, 25068), (4895, 25068), (14684,), (4895,))

We will try to create a simple model using Random Forest and then Neural Networks and check the accuracy

In [19]:
model_RF=RandomForestClassifier(n_estimators=100)

In [20]:
model_RF.fit(xtrain,ytrain)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [21]:
accuracy_score(model_RF.predict(xtrain),ytrain)

1.0

In [22]:
accuracy_score(model_RF.predict(xtest),ytest)

0.6964249233912155

Let's try using NeuralNetwork

In [23]:
model_NN=MLPClassifier()

In [24]:
model_NN.fit(xtrain,ytrain)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [25]:
accuracy_score(model_NN.predict(xtrain),ytrain)

1.0

In [26]:
accuracy_score(model_NN.predict(xtest),ytest)

0.7942798774259449

Let us try using TfidfVectorizer. TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

In [35]:
tfv=TfidfVectorizer()

In [36]:
au_tfv=tfv.fit_transform(au_train.text)

In [37]:
xtrain_tfv,xtest_tfv,ytrain_tfv,ytest_tfv=train_test_split(X,Y)

In [38]:
model_RF_tfv=RandomForestClassifier(n_estimators=100)

In [39]:
model_RF_tfv.fit(xtrain_tfv,ytrain_tfv)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [40]:
accuracy_score(model_RF_tfv.predict(xtrain_tfv),ytrain_tfv)

1.0

In [41]:
accuracy_score(model_RF_tfv.predict(xtest_tfv),ytest_tfv)

0.6905005107252298

In [42]:
model_NN_tfv=MLPClassifier()

In [43]:
model_NN_tfv.fit(xtrain_tfv,ytrain_tfv)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [44]:
accuracy_score(model_NN_tfv.predict(xtrain_tfv),ytrain_tfv)

0.9984336692999183

In [45]:
accuracy_score(model_NN_tfv.predict(xtest_tfv),ytest_tfv)

0.8241062308478039

Thus we find Vectorising the texts using TfidfVectorizer gives best results specially with Neural Networks.