<a href="https://colab.research.google.com/github/KalikaKay/Author-Classification-Project/blob/master/Bag_of_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag of Words 
Using Bag of Words - apply supervised models with GridSearchCV:
* Naive Bayes
* Logistic Regression
* Decision Tree
* Random Forest
* KNN
* SVM 
* Gradient Boosting 

# Data Cleaning



In [3]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from gensim.summarization import keywords
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import datetime as dt
from sklearn.metrics import classification_report
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import ComplementNB
import numpy as np

# file location
PATH = '/content/drive/MyDrive/Author Classification/AuthorTexts'
DOC_PATTERN = r'.*\.txt'

corpus = PlaintextCorpusReader(PATH, DOC_PATTERN)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [4]:
#Dataframe of sentences and authors.
for fileid in corpus.fileids():
  book = pd.DataFrame(corpus.raw(fileids=fileid).split('\n'), columns=['sentence'])
  book['author'] = fileid.split('/')[0]
  try:
    books = pd.concat([books, book])
  except NameError: 
    books = book

In [5]:
#Remove empty strings.
books = books.replace(r'^\s*$', np.NaN, regex=True)
books.dropna(inplace=True)
books.reindex()

#Remove contents and Chapter Titles.
i = 0
for sent in books.sentence:
  if any(c.islower() for c in sent) == False:
    books.iloc[i] = np.NaN
  if 'project gutenberg' in sent.lower():
    books.iloc[i] = np.NaN    
  if 'contents' == sent.lower().strip():
    books.iloc[i] == np.NaN
  i += 1

books.dropna(inplace=True)
books.reindex()
#drop the content headers.
content_index = books[books.sentence == 'Contents '].index.values
books.drop(content_index, inplace=True)

# Feature Engineering

These features will be fed into a pipeline for our models. 

Due to memory constraints, I've opted to visualize the vectors in separate notebooks. 

In [6]:

#Tokenize the data
tokenizer = RegexpTokenizer(r'\w+')
books['tokenized'] = [tokenizer.tokenize(sent.lower()) for sent in books.sentence 
                   if tokenizer.tokenize(sent.lower()) not in stopwords.words('english')]

In [7]:
# Lemmatize the tokens. 
lemmatizer = WordNetLemmatizer()
lemmatized = []
for token in books.tokenized:
  lemmatized.append([lemmatizer.lemmatize(word) for word in token])
books['lemmatized'] = lemmatized

#Declare the variables. 
X = books['lemmatized']
y = books['author']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = X_train.astype(str)
X_test = X_test.astype(str)

#Modeling

## Naive Bayes


```
                              precision    recall  f1-score   support

                Anne Bronte       0.79      0.41      0.54       177
    Bell AKA Bronte Sisters       0.87      0.86      0.86       161
           Charlotte Bronte       0.90      0.69      0.78        13
Edith Rickert & Gleb Botkin       0.81      0.72      0.76        29
               Emily Bronte       0.90      0.78      0.84       426
              Ethel M. Dell       0.91      0.94      0.93      1737
          Fyodor Dostoevsky       0.86      0.89      0.87      1118
                Jane Austen       0.77      0.92      0.84       529
                    Various       0.93      0.85      0.89       435

                   accuracy                           0.88      4625
                  macro avg       0.86      0.78      0.81      4625
               weighted avg       0.88      0.88      0.87      4625

```

In [29]:

pipeline = Pipeline([('vectorizer', CountVectorizer(analyzer='word')), ('nb', ComplementNB())])
params = {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'vectorizer__max_features': (None, 5000, 10000, 50000)
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))

18/02/2021 23:29:06, started grid search
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] vectorizer__max_df=0.5, vectorizer__max_features=None ...........


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.869, total=   0.8s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=None ...........


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s


[CV]  vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.868, total=   0.8s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=None ...........


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.6s remaining:    0.0s


[CV]  vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.867, total=   0.8s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=None ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.873, total=   0.8s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=None ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.872, total=   0.8s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=5000 ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.852, total=   0.8s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=5000 ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.854, total=   0.8s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=5000 ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.850, total=   0.8s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=5000 ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_

[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   48.3s finished


18/02/2021 23:29:55, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.79      0.41      0.54       177
    Bell AKA Bronte Sisters       0.87      0.86      0.86       161
           Charlotte Bronte       0.90      0.69      0.78        13
Edith Rickert & Gleb Botkin       0.81      0.72      0.76        29
               Emily Bronte       0.90      0.78      0.84       426
              Ethel M. Dell       0.91      0.94      0.93      1737
          Fyodor Dostoevsky       0.86      0.89      0.87      1118
                Jane Austen       0.77      0.92      0.84       529
                    Various       0.93      0.85      0.89       435

                   accuracy                           0.88      4625
                  macro avg       0.86      0.78      0.81      4625
               weighted avg       0.88      0.88      0.87      4625



In [30]:
search.best_estimator_

Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=0.75,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('nb',
                 ComplementNB(alpha=1.0, class_prior=None, fit_prior=True,
                              norm=False))],
         verbose=False)

## Logistic Regression

```

                             precision    recall  f1-score   support

                Anne Bronte       0.76      0.54      0.63       177
    Bell AKA Bronte Sisters       0.83      0.80      0.82       161
           Charlotte Bronte       0.88      0.54      0.67        13
Edith Rickert & Gleb Botkin       0.85      0.59      0.69        29
               Emily Bronte       0.88      0.75      0.81       426
              Ethel M. Dell       0.88      0.93      0.90      1737
          Fyodor Dostoevsky       0.78      0.88      0.83      1118
                Jane Austen       0.90      0.82      0.86       529
                    Various       0.90      0.79      0.84       435

                   accuracy                           0.85      4625
                  macro avg       0.85      0.74      0.78      4625
               weighted avg       0.85      0.85      0.85      4625


```



In [12]:
from sklearn.linear_model import LogisticRegression

In [15]:
pipeline = Pipeline([('vectorizer', CountVectorizer(analyzer='word')), ('lr', LogisticRegression())])
params =  {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'vectorizer__max_features': (None, 5000, 10000, 50000),
          "lr__max_iter": [n for n in range(450, 600, 50)]
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))

19/02/2021 20:48:30, started grid search
Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.843, total=  17.4s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   17.4s remaining:    0.0s


[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.837, total=  16.4s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   33.8s remaining:    0.0s


[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.840, total=  15.8s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.835, total=  23.6s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.844, total=  17.1s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.841, total=  11.0s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.837, total=  10.6s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000, s

[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed: 51.9min finished


19/02/2021 21:40:50, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.76      0.54      0.63       177
    Bell AKA Bronte Sisters       0.83      0.80      0.82       161
           Charlotte Bronte       0.88      0.54      0.67        13
Edith Rickert & Gleb Botkin       0.85      0.59      0.69        29
               Emily Bronte       0.88      0.75      0.81       426
              Ethel M. Dell       0.88      0.93      0.90      1737
          Fyodor Dostoevsky       0.78      0.88      0.83      1118
                Jane Austen       0.90      0.82      0.86       529
                    Various       0.90      0.79      0.84       435

                   accuracy                           0.85      4625
                  macro avg       0.85      0.74      0.78      4625
               weighted avg       0.85      0.85      0.85      4625



In [16]:
bow_model

Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=0.75,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('lr',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=450,
                                    multi_class='auto', n_jobs=None,
                    

## Decision Tree

```

                             precision    recall  f1-score   support

                Anne Bronte       0.31      0.21      0.25       177
    Bell AKA Bronte Sisters       0.69      0.59      0.64       161
           Charlotte Bronte       0.56      0.38      0.45        13
Edith Rickert & Gleb Botkin       0.62      0.52      0.57        29
               Emily Bronte       0.69      0.53      0.60       426
              Ethel M. Dell       0.79      0.83      0.81      1737
          Fyodor Dostoevsky       0.60      0.75      0.67      1118
                Jane Austen       0.70      0.63      0.66       529
                    Various       0.75      0.54      0.63       435

                   accuracy                           0.70      4625
                  macro avg       0.63      0.55      0.58      4625
               weighted avg       0.70      0.70      0.69      4625

```




In [17]:
from sklearn.tree import DecisionTreeClassifier

In [18]:
pipeline = Pipeline([('vectorizer', CountVectorizer(analyzer='word')), ('dt', DecisionTreeClassifier())])
params =  {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'vectorizer__max_features': (None, 5000, 10000, 50000),
          'dt__max_depth': [n for n in range(25, 500, 50)] 
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

19/02/2021 21:44:02, started grid search
Fitting 5 folds for each of 120 candidates, totalling 600 fits
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.641, total=   1.8s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.8s remaining:    0.0s


[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.632, total=   1.7s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.5s remaining:    0.0s


[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.631, total=   1.8s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.635, total=   1.7s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.630, total=   1.7s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.639, total=   1.5s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.635, total=   1.4s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000, s

[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed: 46.3min finished


19/02/2021 22:30:26, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.31      0.21      0.25       177
    Bell AKA Bronte Sisters       0.69      0.59      0.64       161
           Charlotte Bronte       0.56      0.38      0.45        13
Edith Rickert & Gleb Botkin       0.62      0.52      0.57        29
               Emily Bronte       0.69      0.53      0.60       426
              Ethel M. Dell       0.79      0.83      0.81      1737
          Fyodor Dostoevsky       0.60      0.75      0.67      1118
                Jane Austen       0.70      0.63      0.66       529
                    Various       0.75      0.54      0.63       435

                   accuracy                           0.70      4625
                  macro avg       0.63      0.55      0.58      4625
               weighted avg       0.70      0.70      0.69      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

##Random Forest



```

                             precision    recall  f1-score   support

                Anne Bronte       0.67      0.07      0.12       177
    Bell AKA Bronte Sisters       0.99      0.51      0.67       161
           Charlotte Bronte       1.00      0.23      0.38        13
Edith Rickert & Gleb Botkin       0.86      0.62      0.72        29
               Emily Bronte       0.93      0.53      0.67       426
              Ethel M. Dell       0.76      0.94      0.84      1737
          Fyodor Dostoevsky       0.68      0.84      0.75      1118
                Jane Austen       0.87      0.70      0.77       529
                    Various       0.90      0.63      0.74       435

                   accuracy                           0.77      4625
                  macro avg       0.85      0.56      0.63      4625
               weighted avg       0.79      0.77      0.75      4625
               
```



In [10]:
from sklearn.ensemble import RandomForestClassifier

In [11]:
pipeline = Pipeline([('vectorizer', CountVectorizer(analyzer='word')), ('rf', RandomForestClassifier())])
params =  {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'rf__max_depth':  [n for n in range(100, 20000, 9900)],
          'rf__n_estimators':  [n for n in range(100, 500, 100)],   
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

20/02/2021 23:46:37, started grid search
Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5 .


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5, score=0.744, total=  18.1s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5 .


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   18.1s remaining:    0.0s


[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5, score=0.746, total=  18.1s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5 .


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   36.3s remaining:    0.0s


[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5, score=0.740, total=  18.7s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5 .
[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5, score=0.736, total=  18.7s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5 .
[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5, score=0.753, total=  18.2s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75 
[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75, score=0.739, total=  19.9s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75 
[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75, score=0.746, total=  19.1s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75 
[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75, score=0.732, total=  19.8s
[CV] rf__max_depth=100, rf__n_estimators=100, vectoriz

[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed: 227.3min finished


21/02/2021 03:36:21, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.67      0.07      0.12       177
    Bell AKA Bronte Sisters       0.99      0.51      0.67       161
           Charlotte Bronte       1.00      0.23      0.38        13
Edith Rickert & Gleb Botkin       0.86      0.62      0.72        29
               Emily Bronte       0.93      0.53      0.67       426
              Ethel M. Dell       0.76      0.94      0.84      1737
          Fyodor Dostoevsky       0.68      0.84      0.75      1118
                Jane Austen       0.87      0.70      0.77       529
                    Various       0.90      0.63      0.74       435

                   accuracy                           0.77      4625
                  macro avg       0.85      0.56      0.63      4625
               weighted avg       0.79      0.77      0.75      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

##KNN

```

                             precision    recall  f1-score   support

                Anne Bronte       0.27      0.02      0.04       177
    Bell AKA Bronte Sisters       0.22      0.01      0.02       161
           Charlotte Bronte       0.06      0.38      0.10        13
Edith Rickert & Gleb Botkin       0.31      0.34      0.33        29
               Emily Bronte       0.42      0.11      0.18       426
              Ethel M. Dell       0.58      0.68      0.62      1737
          Fyodor Dostoevsky       0.38      0.67      0.49      1118
                Jane Austen       0.71      0.11      0.18       529
                    Various       0.49      0.35      0.41       435

                   accuracy                           0.48      4625
                  macro avg       0.38      0.30      0.26      4625
               weighted avg       0.50      0.48      0.43      4625

```



In [18]:
from sklearn.neighbors import KNeighborsClassifier

In [19]:
pipeline = Pipeline([('vectorizer', CountVectorizer(analyzer='word')), ('knn', KNeighborsClassifier())])
params =  {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'vectorizer__max_features': (None, 5000, 10000, 50000),
          'knn__n_neighbors': [39, 17],
          'knn__leaf_size': [30, 12],   
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

21/02/2021 04:58:14, started grid search
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.421, total=   3.1s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.1s remaining:    0.0s


[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.434, total=   3.5s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.6s remaining:    0.0s


[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.437, total=   3.3s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.437, total=   4.4s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.438, total=   5.1s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.418, total=   5.1s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_fe

[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed: 13.9min finished


21/02/2021 05:12:08, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.27      0.02      0.04       177
    Bell AKA Bronte Sisters       0.22      0.01      0.02       161
           Charlotte Bronte       0.06      0.38      0.10        13
Edith Rickert & Gleb Botkin       0.31      0.34      0.33        29
               Emily Bronte       0.42      0.11      0.18       426
              Ethel M. Dell       0.58      0.68      0.62      1737
          Fyodor Dostoevsky       0.38      0.67      0.49      1118
                Jane Austen       0.71      0.11      0.18       529
                    Various       0.49      0.35      0.41       435

                   accuracy                           0.48      4625
                  macro avg       0.38      0.30      0.26      4625
               weighted avg       0.50      0.48      0.43      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

## SVM



```
                              precision    recall  f1-score   support

                Anne Bronte       0.45      0.51      0.48       177
    Bell AKA Bronte Sisters       0.74      0.81      0.77       161
           Charlotte Bronte       0.78      0.54      0.64        13
Edith Rickert & Gleb Botkin       0.65      0.59      0.62        29
               Emily Bronte       0.76      0.72      0.74       426
              Ethel M. Dell       0.86      0.88      0.87      1737
          Fyodor Dostoevsky       0.75      0.82      0.79      1118
                Jane Austen       0.86      0.76      0.81       529
                    Various       0.86      0.69      0.77       435

                   accuracy                           0.80      4625
                  macro avg       0.75      0.70      0.72      4625
               weighted avg       0.81      0.80      0.80      4625

```



In [8]:
from sklearn.svm import SVC

In [9]:
pipeline = Pipeline([('vectorizer', CountVectorizer(analyzer='word')), ('sv', SVC())])
params =  {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'sv__kernel': ['linear', 'rbf'],
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

21/02/2021 12:53:54, started grid search
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] sv__kernel=linear, vectorizer__max_df=0.5 .......................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  sv__kernel=linear, vectorizer__max_df=0.5, score=0.799, total=  43.9s
[CV] sv__kernel=linear, vectorizer__max_df=0.5 .......................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   43.9s remaining:    0.0s


[CV]  sv__kernel=linear, vectorizer__max_df=0.5, score=0.794, total=  43.7s
[CV] sv__kernel=linear, vectorizer__max_df=0.5 .......................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.5min remaining:    0.0s


[CV]  sv__kernel=linear, vectorizer__max_df=0.5, score=0.795, total=  44.1s
[CV] sv__kernel=linear, vectorizer__max_df=0.5 .......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.5, score=0.783, total=  44.2s
[CV] sv__kernel=linear, vectorizer__max_df=0.5 .......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.5, score=0.794, total=  45.7s
[CV] sv__kernel=linear, vectorizer__max_df=0.75 ......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.75, score=0.799, total=  47.1s
[CV] sv__kernel=linear, vectorizer__max_df=0.75 ......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.75, score=0.796, total=  47.9s
[CV] sv__kernel=linear, vectorizer__max_df=0.75 ......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.75, score=0.793, total=  47.0s
[CV] sv__kernel=linear, vectorizer__max_df=0.75 ......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.75, score=0.787, total=  48.4s
[CV] sv__kernel=linear, vectorizer__ma

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed: 37.7min finished


21/02/2021 13:32:32, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.45      0.51      0.48       177
    Bell AKA Bronte Sisters       0.74      0.81      0.77       161
           Charlotte Bronte       0.78      0.54      0.64        13
Edith Rickert & Gleb Botkin       0.65      0.59      0.62        29
               Emily Bronte       0.76      0.72      0.74       426
              Ethel M. Dell       0.86      0.88      0.87      1737
          Fyodor Dostoevsky       0.75      0.82      0.79      1118
                Jane Austen       0.86      0.76      0.81       529
                    Various       0.86      0.69      0.77       435

                   accuracy                           0.80      4625
                  macro avg       0.75      0.70      0.72      4625
               weighted avg       0.81      0.80      0.80      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

##Gradient Boosting

```

                             precision    recall  f1-score   support

                Anne Bronte       0.63      0.44      0.51       177
    Bell AKA Bronte Sisters       0.89      0.60      0.71       161
           Charlotte Bronte       0.58      0.54      0.56        13
Edith Rickert & Gleb Botkin       0.71      0.69      0.70        29
               Emily Bronte       0.89      0.69      0.77       426
              Ethel M. Dell       0.83      0.91      0.87      1737
          Fyodor Dostoevsky       0.71      0.83      0.77      1118
                Jane Austen       0.89      0.77      0.82       529
                    Various       0.85      0.72      0.78       435

                   accuracy                           0.80      4625
                  macro avg       0.78      0.69      0.72      4625
               weighted avg       0.81      0.80      0.80      4625

```



In [9]:
from sklearn.ensemble import GradientBoostingClassifier

In [10]:
pipeline = Pipeline([('vectorizer', CountVectorizer(analyzer='word')), ('gb', GradientBoostingClassifier())])
params =  {
          'gb__max_depth': [n for n in range(300, 500, 100)], 
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

21/02/2021 22:33:47, started grid search
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] gb__max_depth=300 ...............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ................... gb__max_depth=300, score=0.804, total=28.5min
[CV] gb__max_depth=300 ...............................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 28.5min remaining:    0.0s


[CV] ................... gb__max_depth=300, score=0.802, total=27.9min
[CV] gb__max_depth=300 ...............................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 56.4min remaining:    0.0s


[CV] ................... gb__max_depth=300, score=0.795, total=27.1min
[CV] gb__max_depth=300 ...............................................
[CV] ................... gb__max_depth=300, score=0.803, total=28.0min
[CV] gb__max_depth=300 ...............................................
[CV] ................... gb__max_depth=300, score=0.798, total=27.0min
[CV] gb__max_depth=400 ...............................................
[CV] ................... gb__max_depth=400, score=0.802, total=29.0min
[CV] gb__max_depth=400 ...............................................
[CV] ................... gb__max_depth=400, score=0.800, total=28.9min
[CV] gb__max_depth=400 ...............................................
[CV] ................... gb__max_depth=400, score=0.794, total=28.6min
[CV] gb__max_depth=400 ...............................................
[CV] ................... gb__max_depth=400, score=0.800, total=29.2min
[CV] gb__max_depth=400 ...............................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 282.3min finished


22/02/2021 03:53:42, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.63      0.44      0.51       177
    Bell AKA Bronte Sisters       0.89      0.60      0.71       161
           Charlotte Bronte       0.58      0.54      0.56        13
Edith Rickert & Gleb Botkin       0.71      0.69      0.70        29
               Emily Bronte       0.89      0.69      0.77       426
              Ethel M. Dell       0.83      0.91      0.87      1737
          Fyodor Dostoevsky       0.71      0.83      0.77      1118
                Jane Austen       0.89      0.77      0.82       529
                    Various       0.85      0.72      0.78       435

                   accuracy                           0.80      4625
                  macro avg       0.78      0.69      0.72      4625
               weighted avg       0.81      0.80      0.80      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

# Conclusion

In this notebook, a selection of Gutenberg works was cleaned and tested against seven models using a bag of words transformation to predict their authors.

The Naive Bayes provides an overall F1 score that tops greater than 80% as accuracy, macro weighted, and a weighted average. This suggests a fair enough prediction. Additionally, I'm seeing an expedient runtime speed with this model. 

Considering the number of documents located inside the Gutenberg website, time is crucial. 

When it comes to bag of words, author classification against the Gutenberg corpus, and supervised learning models - I would say that Naive Bayes is the model for the job. 

---
*a Thinkful Project by Kalika Kay Curry*
