The Project predicts Author given an abstract text.
Texts from Jane Austen an Arthur Conan Doyle are used to train the model. Then using Multinomial Naive Bayes Classifier, the Model predicts the Author.

In [1]:
import numpy as np
import pandas as pd

#### Importing datasets and Data preprocessing -

Importing Doyle text file -

In [2]:
data_doyle = pd.read_csv('Doyle.txt',header=None,sep='delimiter',encoding = 'unicode_escape', engine='python')

In [3]:
data_doyle.columns = ['Content']

In [4]:
data_doyle['Author'] = 'Conan Doyle'

In [5]:
data_doyle

Unnamed: 0,Content,Author
0,ï»¿The Project Gutenberg eBook of The Adventur...,Conan Doyle
1,This eBook is for the use of anyone anywhere i...,Conan Doyle
2,most other parts of the world at no cost and w...,Conan Doyle
3,"whatsoever. You may copy it, give it away or r...",Conan Doyle
4,of the Project Gutenberg License included with...,Conan Doyle
...,...,...
9621,facility: www.gutenberg.org,Conan Doyle
9622,This website includes information about Projec...,Conan Doyle
9623,including how to make donations to the Project...,Conan Doyle
9624,"Archive Foundation, how to help produce our ne...",Conan Doyle


Importing Austen text file -

In [3]:
data_austen = pd.read_csv('Austen.txt',header=None,sep='delimiter',encoding = 'unicode_escape', engine='python')

In [7]:
data_austen.columns = ['Content']

data_austen['Author'] = 'Jane Austen'

data_austen

Unnamed: 0,Content,Author
0,Project Gutenberg's The Complete Works of Jane...,Jane Austen
1,This eBook is for the use of anyone anywhere a...,Jane Austen
2,almost no restrictions whatsoever. You may co...,Jane Austen
3,re-use it under the terms of the Project Guten...,Jane Austen
4,with this eBook or online at www.gutenberg.org,Jane Austen
...,...,...
67872,http://www.gutenberg.org,Jane Austen
67873,This Web site includes information about Proje...,Jane Austen
67874,including how to make donations to the Project...,Jane Austen
67875,"Archive Foundation, how to help produce our ne...",Jane Austen


Making sure that the data has uniform distribution of samples from Conan Doyle and Jane Austen (9626 instances each).

In [8]:
data_austen = data_austen.iloc[:9626,:]

In [9]:
data_doyle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9626 entries, 0 to 9625
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Content  9626 non-null   object
 1   Author   9626 non-null   object
dtypes: object(2)
memory usage: 150.5+ KB


In [10]:
data_austen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9626 entries, 0 to 9625
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Content  9626 non-null   object
 1   Author   9626 non-null   object
dtypes: object(2)
memory usage: 150.5+ KB


Now that we have equal number of instances, lets combine the two dataframes to form our dataset - 

In [12]:
frames = [data_doyle,data_austen]
data_combined = pd.concat(frames)
data_combined

Unnamed: 0,Content,Author
0,ï»¿The Project Gutenberg eBook of The Adventur...,Conan Doyle
1,This eBook is for the use of anyone anywhere i...,Conan Doyle
2,most other parts of the world at no cost and w...,Conan Doyle
3,"whatsoever. You may copy it, give it away or r...",Conan Doyle
4,of the Project Gutenberg License included with...,Conan Doyle
...,...,...
9621,family whom she need now fear to meet. The eve...,Jane Austen
9622,"more, for her than could have been expected.",Jane Austen
9623,CHAPTER 13,Jane Austen
9624,"Monday, Tuesday, Wednesday, Thursday, Friday, ...",Jane Austen


It can be seen that the index being shown is incorrect, as it has been carried on from the previous dataframes. Let's fix that -

In [13]:
data_combined = data_combined.reset_index(drop=True)
data_combined

Unnamed: 0,Content,Author
0,ï»¿The Project Gutenberg eBook of The Adventur...,Conan Doyle
1,This eBook is for the use of anyone anywhere i...,Conan Doyle
2,most other parts of the world at no cost and w...,Conan Doyle
3,"whatsoever. You may copy it, give it away or r...",Conan Doyle
4,of the Project Gutenberg License included with...,Conan Doyle
...,...,...
19247,family whom she need now fear to meet. The eve...,Jane Austen
19248,"more, for her than could have been expected.",Jane Austen
19249,CHAPTER 13,Jane Austen
19250,"Monday, Tuesday, Wednesday, Thursday, Friday, ...",Jane Austen


Let's put labels on our data, 0 if the Author is Doyle and 1 if the author is Austen - 

In [14]:
data_combined['label'] = data_combined['Author'].apply(lambda x: 0 if x=='Conan Doyle' else 1)
data_combined

Unnamed: 0,Content,Author,label
0,ï»¿The Project Gutenberg eBook of The Adventur...,Conan Doyle,0
1,This eBook is for the use of anyone anywhere i...,Conan Doyle,0
2,most other parts of the world at no cost and w...,Conan Doyle,0
3,"whatsoever. You may copy it, give it away or r...",Conan Doyle,0
4,of the Project Gutenberg License included with...,Conan Doyle,0
...,...,...,...
19247,family whom she need now fear to meet. The eve...,Jane Austen,1
19248,"more, for her than could have been expected.",Jane Austen,1
19249,CHAPTER 13,Jane Austen,1
19250,"Monday, Tuesday, Wednesday, Thursday, Friday, ...",Jane Austen,1


#### Training and Test sets -

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_combined['Content'], data_combined['label'], random_state=1)

### Count Vectorization and Model Building

Converting abstract content into word count vectors. This way, the Naive Bayes classifier can quantify the occurence of a particular word in each category -

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', lowercase=True, stop_words='english')
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

In [17]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_cv, y_train)
predictions = naive_bayes.predict(X_test_cv)

### Evaluation

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
print('Accuracy score: ', accuracy_score(y_test, predictions))
print('Precision score: ', precision_score(y_test, predictions))
print('Recall score: ', recall_score(y_test, predictions))
print('Confusion Matrix:\n', confusion_matrix(y_test,predictions))

Accuracy score:  0.8599626012881778
Precision score:  0.8373565492679066
Recall score:  0.8894493484657419
Confusion Matrix:
 [[2023  411]
 [ 263 2116]]


We have achieved an accuracy of about 86% in successfully predicting the Author from an abstract from their written works.

In [19]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.88      0.83      0.86      2434
           1       0.84      0.89      0.86      2379

    accuracy                           0.86      4813
   macro avg       0.86      0.86      0.86      4813
weighted avg       0.86      0.86      0.86      4813

