<a href="https://colab.research.google.com/github/KalikaKay/Author-Classification-Project/blob/master/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TF-IDF
Using TF-IDF - apply supervised models with GridSearchCV:
* Naive Bayes
* Logistic Regression
* Decision Tree
* Random Forest
* KNN
* SVM 
* Gradient Boosting 

# Data Cleaning



In [3]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from gensim.summarization import keywords
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import datetime as dt
from sklearn.metrics import classification_report
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import ComplementNB
import numpy as np

# file location
PATH = '/content/drive/MyDrive/Author Classification/AuthorTexts'
DOC_PATTERN = r'.*\.txt'

corpus = PlaintextCorpusReader(PATH, DOC_PATTERN)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
#Dataframe of sentences and authors.
for fileid in corpus.fileids():
  book = pd.DataFrame(corpus.raw(fileids=fileid).split('\n'), columns=['sentence'])
  book['author'] = fileid.split('/')[0]
  try:
    books = pd.concat([books, book])
  except NameError: 
    books = book

In [5]:
#Remove empty strings.
books = books.replace(r'^\s*$', np.NaN, regex=True)
books.dropna(inplace=True)
books.reindex()

#Remove contents and Chapter Titles.
i = 0
for sent in books.sentence:
  if any(c.islower() for c in sent) == False:
    books.iloc[i] = np.NaN
  if 'project gutenberg' in sent.lower():
    books.iloc[i] = np.NaN    
  if 'contents' == sent.lower().strip():
    books.iloc[i] == np.NaN
  i += 1

books.dropna(inplace=True)
books.reindex()
#drop the content headers.
content_index = books[books.sentence == 'Contents '].index.values
books.drop(content_index, inplace=True)

# Feature Engineering

These features will be fed into a pipeline for our models. 

Due to memory constraints, I've opted to visualize the vectors in separate notebooks. 

In [6]:

#Tokenize the data
tokenizer = RegexpTokenizer(r'\w+')
books['tokenized'] = [tokenizer.tokenize(sent.lower()) for sent in books.sentence 
                   if tokenizer.tokenize(sent.lower()) not in stopwords.words('english')]

In [7]:
# Lemmatize the tokens. 
lemmatizer = WordNetLemmatizer()
lemmatized = []
for token in books.tokenized:
  lemmatized.append([lemmatizer.lemmatize(word) for word in token])
books['lemmatized'] = lemmatized

#Declare the variables. 
X = books['lemmatized']
y = books['author']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = X_train.astype(str)
X_test = X_test.astype(str)

#Modeling

## Naive Bayes

```
                             precision    recall  f1-score   support

                Anne Bronte       0.82      0.36      0.50       177
    Bell AKA Bronte Sisters       0.84      0.89      0.86       161
           Charlotte Bronte       0.90      0.69      0.78        13
Edith Rickert & Gleb Botkin       0.76      0.76      0.76        29
               Emily Bronte       0.89      0.77      0.83       426
              Ethel M. Dell       0.88      0.95      0.91      1737
          Fyodor Dostoevsky       0.88      0.86      0.87      1118
                Jane Austen       0.78      0.91      0.84       529
                    Various       0.92      0.85      0.88       435

                   accuracy                           0.87      4625
                  macro avg       0.85      0.78      0.80      4625
               weighted avg       0.87      0.87      0.86      4625


```



In [None]:

pipeline = Pipeline([('vectorizer', TfidfVectorizer(min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)), ('nb', ComplementNB())])
params = {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'vectorizer__max_features': (None, 5000, 10000, 50000)
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

22/02/2021 16:46:06, started grid search
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] vectorizer__max_df=0.5, vectorizer__max_features=None ...........


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.866, total=   0.7s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=None ...........


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s


[CV]  vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.864, total=   0.7s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=None ...........


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.4s remaining:    0.0s


[CV]  vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.860, total=   0.7s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=None ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.864, total=   0.7s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=None ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.864, total=   0.7s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=5000 ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.858, total=   0.7s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=5000 ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.852, total=   0.7s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=5000 ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.847, total=   0.7s
[CV] vectorizer__max_df=0.5, vectorizer__max_features=5000 ...........
[CV]  vectorizer__max_df=0.5, vectorizer__max_

[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   41.6s finished


22/02/2021 16:46:49, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.82      0.36      0.50       177
    Bell AKA Bronte Sisters       0.84      0.89      0.86       161
           Charlotte Bronte       0.90      0.69      0.78        13
Edith Rickert & Gleb Botkin       0.76      0.76      0.76        29
               Emily Bronte       0.89      0.77      0.83       426
              Ethel M. Dell       0.88      0.95      0.91      1737
          Fyodor Dostoevsky       0.88      0.86      0.87      1118
                Jane Austen       0.78      0.91      0.84       529
                    Various       0.92      0.85      0.88       435

                   accuracy                           0.87      4625
                  macro avg       0.85      0.78      0.80      4625
               weighted avg       0.87      0.87      0.86      4625



## Logistic Regression

```
                             precision    recall  f1-score   support

                Anne Bronte       0.91      0.49      0.63       177
    Bell AKA Bronte Sisters       0.89      0.77      0.83       161
           Charlotte Bronte       1.00      0.08      0.14        13
Edith Rickert & Gleb Botkin       0.90      0.66      0.76        29
               Emily Bronte       0.93      0.69      0.79       426
              Ethel M. Dell       0.85      0.95      0.89      1737
          Fyodor Dostoevsky       0.79      0.89      0.84      1118
                Jane Austen       0.90      0.80      0.85       529
                    Various       0.89      0.76      0.82       435

                   accuracy                           0.85      4625
                  macro avg       0.90      0.68      0.73      4625
               weighted avg       0.85      0.85      0.84      4625

```



In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
pipeline = Pipeline([('vectorizer', TfidfVectorizer(min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)), ('lr', LogisticRegression())])
params =  {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'vectorizer__max_features': (None, 5000, 10000, 50000),
          "lr__max_iter": [n for n in range(450, 600, 50)]
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

22/02/2021 16:50:25, started grid search
Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.834, total=  11.4s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.4s remaining:    0.0s


[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.824, total=  10.8s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   22.2s remaining:    0.0s


[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.837, total=  11.3s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.833, total=  10.5s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.833, total=  10.6s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.833, total=   9.3s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.828, total=   9.0s
[CV] lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  lr__max_iter=450, vectorizer__max_df=0.5, vectorizer__max_features=5000, s

[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed: 32.0min finished


22/02/2021 17:22:38, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.91      0.49      0.63       177
    Bell AKA Bronte Sisters       0.89      0.77      0.83       161
           Charlotte Bronte       1.00      0.08      0.14        13
Edith Rickert & Gleb Botkin       0.90      0.66      0.76        29
               Emily Bronte       0.93      0.69      0.79       426
              Ethel M. Dell       0.85      0.95      0.89      1737
          Fyodor Dostoevsky       0.79      0.89      0.84      1118
                Jane Austen       0.90      0.80      0.85       529
                    Various       0.89      0.76      0.82       435

                   accuracy                           0.85      4625
                  macro avg       0.90      0.68      0.73      4625
               weighted avg       0.85      0.85      0.84      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

## Decision Tree

```

                             precision    recall  f1-score   support

                Anne Bronte       0.30      0.21      0.25       177
    Bell AKA Bronte Sisters       0.57      0.55      0.56       161
           Charlotte Bronte       0.62      0.38      0.48        13
Edith Rickert & Gleb Botkin       0.45      0.48      0.47        29
               Emily Bronte       0.62      0.52      0.56       426
              Ethel M. Dell       0.76      0.80      0.78      1737
          Fyodor Dostoevsky       0.62      0.69      0.65      1118
                Jane Austen       0.66      0.62      0.64       529
                    Various       0.64      0.57      0.60       435

                   accuracy                           0.67      4625
                  macro avg       0.58      0.54      0.56      4625
               weighted avg       0.67      0.67      0.67      4625
   
```




In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
pipeline = Pipeline([('vectorizer',TfidfVectorizer(min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)), ('dt', DecisionTreeClassifier())])
params =  {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'vectorizer__max_features': (None, 5000, 10000, 50000),
          'dt__max_depth': [n for n in range(25, 500, 50)] 
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

22/02/2021 17:29:32, started grid search
Fitting 5 folds for each of 120 candidates, totalling 600 fits
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.633, total=   2.2s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.2s remaining:    0.0s


[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.626, total=   2.2s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.4s remaining:    0.0s


[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.634, total=   2.2s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.631, total=   2.2s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.623, total=   2.1s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.638, total=   1.9s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.632, total=   2.0s
[CV] dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  dt__max_depth=25, vectorizer__max_df=0.5, vectorizer__max_features=5000, s

[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed: 51.3min finished


22/02/2021 18:20:59, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.30      0.21      0.25       177
    Bell AKA Bronte Sisters       0.57      0.55      0.56       161
           Charlotte Bronte       0.62      0.38      0.48        13
Edith Rickert & Gleb Botkin       0.45      0.48      0.47        29
               Emily Bronte       0.62      0.52      0.56       426
              Ethel M. Dell       0.76      0.80      0.78      1737
          Fyodor Dostoevsky       0.62      0.69      0.65      1118
                Jane Austen       0.66      0.62      0.64       529
                    Various       0.64      0.57      0.60       435

                   accuracy                           0.67      4625
                  macro avg       0.58      0.54      0.56      4625
               weighted avg       0.67      0.67      0.67      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

##Random Forest

```

                             precision    recall  f1-score   support

                Anne Bronte       0.82      0.08      0.14       177
    Bell AKA Bronte Sisters       0.95      0.59      0.73       161
           Charlotte Bronte       1.00      0.38      0.56        13
Edith Rickert & Gleb Botkin       0.78      0.48      0.60        29
               Emily Bronte       0.98      0.53      0.69       426
              Ethel M. Dell       0.76      0.94      0.84      1737
          Fyodor Dostoevsky       0.67      0.84      0.74      1118
                Jane Austen       0.89      0.68      0.77       529
                    Various       0.92      0.65      0.76       435

                   accuracy                           0.77      4625
                  macro avg       0.86      0.57      0.65      4625
               weighted avg       0.80      0.77      0.76      4625
               
```



In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
pipeline = Pipeline([('vectorizer', TfidfVectorizer(min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)), ('rf', RandomForestClassifier())])
params =  {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'rf__max_depth':  [n for n in range(100, 20000, 9900)],
          'rf__n_estimators':  [n for n in range(100, 500, 100)],   
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

22/02/2021 21:02:53, started grid search
Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5 .


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5, score=0.741, total=  17.3s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5 .


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   17.4s remaining:    0.0s


[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5, score=0.744, total=  17.1s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5 .


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   34.5s remaining:    0.0s


[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5, score=0.742, total=  17.2s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5 .
[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5, score=0.738, total=  16.9s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5 .
[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.5, score=0.753, total=  17.0s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75 
[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75, score=0.742, total=  18.2s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75 
[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75, score=0.738, total=  18.8s
[CV] rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75 
[CV]  rf__max_depth=100, rf__n_estimators=100, vectorizer__max_df=0.75, score=0.742, total=  19.0s
[CV] rf__max_depth=100, rf__n_estimators=100, vectoriz

[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed: 186.2min finished


23/02/2021 00:11:00, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.82      0.08      0.14       177
    Bell AKA Bronte Sisters       0.95      0.59      0.73       161
           Charlotte Bronte       1.00      0.38      0.56        13
Edith Rickert & Gleb Botkin       0.78      0.48      0.60        29
               Emily Bronte       0.98      0.53      0.69       426
              Ethel M. Dell       0.76      0.94      0.84      1737
          Fyodor Dostoevsky       0.67      0.84      0.74      1118
                Jane Austen       0.89      0.68      0.77       529
                    Various       0.92      0.65      0.76       435

                   accuracy                           0.77      4625
                  macro avg       0.86      0.57      0.65      4625
               weighted avg       0.80      0.77      0.76      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

##KNN

```

                             precision    recall  f1-score   support

                Anne Bronte       1.00      0.11      0.20       177
    Bell AKA Bronte Sisters       0.94      0.29      0.44       161
           Charlotte Bronte       1.00      0.31      0.47        13
Edith Rickert & Gleb Botkin       0.79      0.66      0.72        29
               Emily Bronte       0.91      0.17      0.28       426
              Ethel M. Dell       0.80      0.85      0.82      1737
          Fyodor Dostoevsky       0.52      0.92      0.67      1118
                Jane Austen       0.88      0.58      0.70       529
                    Various       0.97      0.63      0.76       435

                   accuracy                           0.70      4625
                  macro avg       0.87      0.50      0.56      4625
               weighted avg       0.78      0.70      0.68      4625


```



In [18]:
from sklearn.neighbors import KNeighborsClassifier

In [19]:
pipeline = Pipeline([('vectorizer', TfidfVectorizer(min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)), ('knn', KNeighborsClassifier())])
params =  {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'vectorizer__max_features': (None, 5000, 10000, 50000),
          'knn__n_neighbors': [39, 17],
          'knn__leaf_size': [30, 12],   
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

23/02/2021 01:23:58, started grid search
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.693, total=   3.3s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.3s remaining:    0.0s


[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.691, total=   2.9s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.2s remaining:    0.0s


[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.705, total=   2.9s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.715, total=   2.8s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None 
[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=None, score=0.713, total=   2.8s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=5000, score=0.345, total=   2.9s
[CV] knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_features=5000 
[CV]  knn__leaf_size=30, knn__n_neighbors=39, vectorizer__max_df=0.5, vectorizer__max_fe

[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed: 12.0min finished


23/02/2021 01:36:01, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       1.00      0.11      0.20       177
    Bell AKA Bronte Sisters       0.94      0.29      0.44       161
           Charlotte Bronte       1.00      0.31      0.47        13
Edith Rickert & Gleb Botkin       0.79      0.66      0.72        29
               Emily Bronte       0.91      0.17      0.28       426
              Ethel M. Dell       0.80      0.85      0.82      1737
          Fyodor Dostoevsky       0.52      0.92      0.67      1118
                Jane Austen       0.88      0.58      0.70       529
                    Various       0.97      0.63      0.76       435

                   accuracy                           0.70      4625
                  macro avg       0.87      0.50      0.56      4625
               weighted avg       0.78      0.70      0.68      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

## SVM



```

                             precision    recall  f1-score   support

                Anne Bronte       0.82      0.53      0.65       177
    Bell AKA Bronte Sisters       0.89      0.84      0.87       161
           Charlotte Bronte       0.88      0.54      0.67        13
Edith Rickert & Gleb Botkin       0.83      0.69      0.75        29
               Emily Bronte       0.92      0.71      0.80       426
              Ethel M. Dell       0.86      0.94      0.90      1737
          Fyodor Dostoevsky       0.79      0.89      0.84      1118
                Jane Austen       0.91      0.83      0.87       529
                    Various       0.95      0.79      0.86       435

                   accuracy                           0.86      4625
                  macro avg       0.87      0.75      0.80      4625
               weighted avg       0.86      0.86      0.86      4625

```



In [20]:
from sklearn.svm import SVC

In [21]:
pipeline = Pipeline([('vectorizer',TfidfVectorizer(min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)), ('sv', SVC())])
params =  {
          'vectorizer__max_df': (0.5, 0.75, 1.0), 
          'sv__kernel': ['linear', 'rbf'],
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

23/02/2021 01:44:35, started grid search
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] sv__kernel=linear, vectorizer__max_df=0.5 .......................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  sv__kernel=linear, vectorizer__max_df=0.5, score=0.850, total=  57.2s
[CV] sv__kernel=linear, vectorizer__max_df=0.5 .......................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   57.2s remaining:    0.0s


[CV]  sv__kernel=linear, vectorizer__max_df=0.5, score=0.841, total=  57.2s
[CV] sv__kernel=linear, vectorizer__max_df=0.5 .......................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.9min remaining:    0.0s


[CV]  sv__kernel=linear, vectorizer__max_df=0.5, score=0.851, total=  58.3s
[CV] sv__kernel=linear, vectorizer__max_df=0.5 .......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.5, score=0.848, total=  58.5s
[CV] sv__kernel=linear, vectorizer__max_df=0.5 .......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.5, score=0.847, total=  57.9s
[CV] sv__kernel=linear, vectorizer__max_df=0.75 ......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.75, score=0.848, total= 1.0min
[CV] sv__kernel=linear, vectorizer__max_df=0.75 ......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.75, score=0.846, total= 1.0min
[CV] sv__kernel=linear, vectorizer__max_df=0.75 ......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.75, score=0.850, total= 1.0min
[CV] sv__kernel=linear, vectorizer__max_df=0.75 ......................
[CV]  sv__kernel=linear, vectorizer__max_df=0.75, score=0.852, total= 1.0min
[CV] sv__kernel=linear, vectorizer__ma

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed: 44.6min finished


23/02/2021 02:30:28, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.82      0.53      0.65       177
    Bell AKA Bronte Sisters       0.89      0.84      0.87       161
           Charlotte Bronte       0.88      0.54      0.67        13
Edith Rickert & Gleb Botkin       0.83      0.69      0.75        29
               Emily Bronte       0.92      0.71      0.80       426
              Ethel M. Dell       0.86      0.94      0.90      1737
          Fyodor Dostoevsky       0.79      0.89      0.84      1118
                Jane Austen       0.91      0.83      0.87       529
                    Various       0.95      0.79      0.86       435

                   accuracy                           0.86      4625
                  macro avg       0.87      0.75      0.80      4625
               weighted avg       0.86      0.86      0.86      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

##Gradient Boosting

```

                             precision    recall  f1-score   support

                Anne Bronte       0.71      0.46      0.56       177
    Bell AKA Bronte Sisters       0.78      0.65      0.71       161
           Charlotte Bronte       0.70      0.54      0.61        13
Edith Rickert & Gleb Botkin       0.73      0.55      0.63        29
               Emily Bronte       0.88      0.67      0.76       426
              Ethel M. Dell       0.80      0.92      0.86      1737
          Fyodor Dostoevsky       0.75      0.81      0.78      1118
                Jane Austen       0.88      0.75      0.81       529
                    Various       0.90      0.77      0.83       435

                   accuracy                           0.81      4625
                  macro avg       0.79      0.68      0.73      4625
               weighted avg       0.81      0.81      0.80      4625


```
A fun note on gridsearchcv's best estimator. The difference in accuracy between a max depth of ten and the max depth of 500 seemed insignificant. The speed however - is a difference between an average fit time of eight minutes and seventy minutes. The seventy minute model was returned by the utility. 



In [9]:
from sklearn.ensemble import GradientBoostingClassifier

In [10]:
pipeline = Pipeline([('vectorizer', TfidfVectorizer(min_df=2, use_idf=True, norm=u'l2', smooth_idf=True)), ('gb', GradientBoostingClassifier())])
params =  {
          'gb__max_depth': [10, 500], 
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

23/02/2021 11:58:40, started grid search
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] gb__max_depth=10 ................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .................... gb__max_depth=10, score=0.798, total= 7.2min
[CV] gb__max_depth=10 ................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  7.2min remaining:    0.0s


[CV] .................... gb__max_depth=10, score=0.796, total= 7.3min
[CV] gb__max_depth=10 ................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 14.5min remaining:    0.0s


[CV] .................... gb__max_depth=10, score=0.788, total= 7.4min
[CV] gb__max_depth=10 ................................................
[CV] .................... gb__max_depth=10, score=0.799, total= 7.6min
[CV] gb__max_depth=10 ................................................
[CV] .................... gb__max_depth=10, score=0.793, total= 7.1min
[CV] gb__max_depth=500 ...............................................
[CV] ................... gb__max_depth=500, score=0.804, total=67.4min
[CV] gb__max_depth=500 ...............................................
[CV] ................... gb__max_depth=500, score=0.791, total=67.5min
[CV] gb__max_depth=500 ...............................................
[CV] ................... gb__max_depth=500, score=0.789, total=69.3min
[CV] gb__max_depth=500 ...............................................
[CV] ................... gb__max_depth=500, score=0.797, total=67.3min
[CV] gb__max_depth=500 ...............................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 377.5min finished


23/02/2021 19:52:28, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.71      0.46      0.56       177
    Bell AKA Bronte Sisters       0.78      0.65      0.71       161
           Charlotte Bronte       0.70      0.54      0.61        13
Edith Rickert & Gleb Botkin       0.73      0.55      0.63        29
               Emily Bronte       0.88      0.67      0.76       426
              Ethel M. Dell       0.80      0.92      0.86      1737
          Fyodor Dostoevsky       0.75      0.81      0.78      1118
                Jane Austen       0.88      0.75      0.81       529
                    Various       0.90      0.77      0.83       435

                   accuracy                           0.81      4625
                  macro avg       0.79      0.68      0.73      4625
               weighted avg       0.81      0.81      0.80      4625

Pipeline(memory=None,
         steps=[('vectorizer',
     

# Conclusion

In this notebook, a selection of Gutenberg works was cleaned and tested against seven models using a TF-IDF transformation to predict their authors.

The Naive Bayes provides an overall F1 score that tops greater than 75% as accuracy, macro weighted, and a weighted average. This suggests a fair enough prediction. Additionally, I'm seeing an expedient runtime speed with this model. 

Considering the number of documents located inside the Gutenberg website, time is crucial. 

When it comes to TF-IDF, author classification against the Gutenberg corpus, and supervised learning models - I would say that Naive Bayes is the model for the job. 

---
*a Thinkful Project by Kalika Kay Curry*
