<a href="https://colab.research.google.com/github/KalikaKay/Author-Classification-Project/blob/master/Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec
Using Word2Vec - apply supervised models with GridSearchCV:
* Naive Bayes
* Logistic Regression
* Decision Tree
* Random Forest
* KNN
* SVM 
* Gradient Boosting 


I had a tough decision between using Doc2Vec and Word2Vec. I was getting terrible results on Naive Bayes with both models; so I went back to the original modeling requests. 

The original assignment requested that I utilize Word2Vec. The Word2Vec modeling is excessivel long - while Doc2Vec is much faster. Doc2Vec required that I analyze based on sentences in order to acheive a Naive Bayes score > .30.

# Data Cleaning



In [12]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.preprocessing import MinMaxScaler
from gensim.models import Word2Vec
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from gensim.summarization import keywords
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import datetime as dt
from sklearn.metrics import classification_report
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import ComplementNB
import numpy as np

# file location
PATH = '/content/drive/MyDrive/Author Classification/AuthorTexts'
DOC_PATTERN = r'.*\.txt'

corpus = PlaintextCorpusReader(PATH, DOC_PATTERN)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [13]:
#Dataframe of sentences and authors.
for fileid in corpus.fileids():
  book = pd.DataFrame(corpus.raw(fileids=fileid).split('\n'), columns=['sentence'])
  book['author'] = fileid.split('/')[0]
  try:
    books = pd.concat([books, book])
  except NameError: 
    books = book

In [14]:
#Remove empty strings.
books = books.replace(r'^\s*$', np.NaN, regex=True)
books.dropna(inplace=True)
books.reindex()

#Remove contents and Chapter Titles.
i = 0
for sent in books.sentence:
  if any(c.islower() for c in sent) == False:
    books.iloc[i] = np.NaN
  if 'project gutenberg' in sent.lower():
    books.iloc[i] = np.NaN    
  if 'contents' == sent.lower().strip():
    books.iloc[i] == np.NaN
  i += 1

books.dropna(inplace=True)
books.reindex()
#drop the content headers.
content_index = books[books.sentence == 'Contents '].index.values
books.drop(content_index, inplace=True)

# Feature Engineering

These features will be fed into a pipeline for our models. 

Due to memory constraints, I've opted to visualize the vectors in separate notebooks. 

In [15]:
#Tokenize the data
tokenizer = RegexpTokenizer(r'\w+')
books['tokenized'] = [tokenizer.tokenize(sent.lower()) for sent in books.sentence 
                   if tokenizer.tokenize(sent.lower()) not in stopwords.words('english')]

# Lemmatize the tokens. 
lemmatizer = WordNetLemmatizer()
lemmatized = []
for token in books.tokenized:
  lemmatized.append([lemmatizer.lemmatize(word) for word in token])
books['lemmatized'] = lemmatized

In [16]:
y = books['author']

In [17]:
vector_size = max(len(x) for x in books['lemmatized'] ) 
model =  Word2Vec(books['lemmatized'], min_count = 1, seed = 1, size=vector_size)
word2vec_arr = np.zeros((books.shape[0], model.vector_size))

for i, sentence in enumerate(books['lemmatized']):
    word2vec_arr[i,:] = np.mean([model[word] for word in sentence], axis=0)
X = word2vec_arr

  


In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = X_train
X_test = X_test

#Modeling

## Naive Bayes

```
                             precision    recall  f1-score   support

                Anne Bronte       0.00      0.00      0.00       177
    Bell AKA Bronte Sisters       0.00      0.00      0.00       161
           Charlotte Bronte       0.00      0.00      0.00        13
Edith Rickert & Gleb Botkin       0.00      0.00      0.00        29
               Emily Bronte       0.25      0.00      0.00       426
              Ethel M. Dell       0.57      0.58      0.58      1737
          Fyodor Dostoevsky       0.36      0.19      0.25      1118
                Jane Austen       0.33      0.47      0.39       529
                    Various       0.21      0.71      0.32       435

                   accuracy                           0.38      4625
                  macro avg       0.19      0.22      0.17      4625
               weighted avg       0.38      0.38      0.35      4625


```



In [45]:

pipeline = Pipeline([('s', MinMaxScaler()), ('nb', ComplementNB())])
params = {
          's__feature_range' : [(0, abs(X).max()), (0, 1), (0, 2*abs(X).max()), (0, 5*abs(X).max())]
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

24/02/2021 02:13:45, started grid search
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] s__feature_range=(0, 1.4256254434585571) ........................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  s__feature_range=(0, 1.4256254434585571), score=0.372, total=   1.2s
[CV] s__feature_range=(0, 1.4256254434585571) ........................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s


[CV]  s__feature_range=(0, 1.4256254434585571), score=0.375, total=   1.1s
[CV] s__feature_range=(0, 1.4256254434585571) ........................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.3s remaining:    0.0s


[CV]  s__feature_range=(0, 1.4256254434585571), score=0.387, total=   1.2s
[CV] s__feature_range=(0, 1.4256254434585571) ........................
[CV]  s__feature_range=(0, 1.4256254434585571), score=0.390, total=   1.1s
[CV] s__feature_range=(0, 1.4256254434585571) ........................
[CV]  s__feature_range=(0, 1.4256254434585571), score=0.381, total=   1.1s
[CV] s__feature_range=(0, 1) .........................................
[CV] ............. s__feature_range=(0, 1), score=0.372, total=   1.1s
[CV] s__feature_range=(0, 1) .........................................
[CV] ............. s__feature_range=(0, 1), score=0.375, total=   1.1s
[CV] s__feature_range=(0, 1) .........................................
[CV] ............. s__feature_range=(0, 1), score=0.387, total=   1.1s
[CV] s__feature_range=(0, 1) .........................................
[CV] ............. s__feature_range=(0, 1), score=0.390, total=   1.1s
[CV] s__feature_range=(0, 1) ....................................

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   23.0s finished


24/02/2021 02:14:09, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.00      0.00      0.00       177
    Bell AKA Bronte Sisters       0.00      0.00      0.00       161
           Charlotte Bronte       0.00      0.00      0.00        13
Edith Rickert & Gleb Botkin       0.00      0.00      0.00        29
               Emily Bronte       0.00      0.00      0.00       426
              Ethel M. Dell       0.61      0.50      0.54      1737
          Fyodor Dostoevsky       0.31      0.35      0.33      1118
                Jane Austen       0.30      0.41      0.35       529
                    Various       0.22      0.62      0.32       435

                   accuracy                           0.37      4625
                  macro avg       0.16      0.21      0.17      4625
               weighted avg       0.36      0.37      0.35      4625

Pipeline(memory=None,
         steps=[('s',
              

  _warn_prf(average, modifier, msg_start, len(result))


## Logistic Regression



```

                             precision    recall  f1-score   support

                Anne Bronte       0.29      0.01      0.02       177
    Bell AKA Bronte Sisters       0.48      0.34      0.40       161
           Charlotte Bronte       0.00      0.00      0.00        13
Edith Rickert & Gleb Botkin       0.00      0.00      0.00        29
               Emily Bronte       0.21      0.04      0.06       426
              Ethel M. Dell       0.60      0.80      0.68      1737
          Fyodor Dostoevsky       0.49      0.52      0.50      1118
                Jane Austen       0.52      0.53      0.52       529
                    Various       0.46      0.40      0.43       435

                   accuracy                           0.54      4625
                  macro avg       0.34      0.29      0.29      4625
               weighted avg       0.49      0.54      0.50      4625

```



Originally ran logistic regression with a MinMax scaler. When the regressor failed to converge, worked with the MaxAbsScaler. 

Logistic regression still failed to converge. 

Attempts to increase the max iter did not resolve the issue and resulted in a long enough runtime such that the colab notebook runtime failed to complete the execution.

A grid search with various solvers are shown below. 


In [9]:
from sklearn.linear_model import LogisticRegression

In [10]:
from sklearn.preprocessing import MaxAbsScaler
pipeline = Pipeline([('s', MaxAbsScaler()), ('lr', LogisticRegression())])
params =  {
          "lr__solver": ['sag', 'newton-cg', 'lbfgs', 'liblinear', 'saga']
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

24/02/2021 12:50:14, started grid search
Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV] lr__solver=sag ..................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.2min remaining:    0.0s


[CV] ...................... lr__solver=sag, score=0.517, total= 2.2min
[CV] lr__solver=sag ..................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  4.3min remaining:    0.0s


[CV] ...................... lr__solver=sag, score=0.529, total= 2.2min
[CV] lr__solver=sag ..................................................




[CV] ...................... lr__solver=sag, score=0.521, total= 2.2min
[CV] lr__solver=sag ..................................................




[CV] ...................... lr__solver=sag, score=0.530, total= 2.2min
[CV] lr__solver=sag ..................................................




[CV] ...................... lr__solver=sag, score=0.528, total= 2.2min
[CV] lr__solver=newton-cg ............................................
[CV] ................ lr__solver=newton-cg, score=0.527, total= 5.0min
[CV] lr__solver=newton-cg ............................................
[CV] ................ lr__solver=newton-cg, score=0.540, total= 5.4min
[CV] lr__solver=newton-cg ............................................
[CV] ................ lr__solver=newton-cg, score=0.532, total= 4.9min
[CV] lr__solver=newton-cg ............................................
[CV] ................ lr__solver=newton-cg, score=0.539, total= 4.5min
[CV] lr__solver=newton-cg ............................................
[CV] ................ lr__solver=newton-cg, score=0.536, total= 4.6min
[CV] lr__solver=lbfgs ................................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[CV] .................... lr__solver=lbfgs, score=0.491, total=  31.0s
[CV] lr__solver=lbfgs ................................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[CV] .................... lr__solver=lbfgs, score=0.515, total=  30.7s
[CV] lr__solver=lbfgs ................................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[CV] .................... lr__solver=lbfgs, score=0.494, total=  30.0s
[CV] lr__solver=lbfgs ................................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[CV] .................... lr__solver=lbfgs, score=0.502, total=  31.0s
[CV] lr__solver=lbfgs ................................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[CV] .................... lr__solver=lbfgs, score=0.506, total=  30.1s
[CV] lr__solver=liblinear ............................................
[CV] ................ lr__solver=liblinear, score=0.518, total= 4.2min
[CV] lr__solver=liblinear ............................................
[CV] ................ lr__solver=liblinear, score=0.530, total= 4.2min
[CV] lr__solver=liblinear ............................................
[CV] ................ lr__solver=liblinear, score=0.523, total= 4.1min
[CV] lr__solver=liblinear ............................................
[CV] ................ lr__solver=liblinear, score=0.523, total= 4.3min
[CV] lr__solver=liblinear ............................................
[CV] ................ lr__solver=liblinear, score=0.527, total= 4.1min
[CV] lr__solver=saga .................................................




[CV] ..................... lr__solver=saga, score=0.515, total= 2.9min
[CV] lr__solver=saga .................................................




[CV] ..................... lr__solver=saga, score=0.525, total= 2.9min
[CV] lr__solver=saga .................................................




[CV] ..................... lr__solver=saga, score=0.512, total= 2.9min
[CV] lr__solver=saga .................................................




[CV] ..................... lr__solver=saga, score=0.523, total= 2.8min
[CV] lr__solver=saga .................................................


[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed: 72.9min finished


[CV] ..................... lr__solver=saga, score=0.520, total= 2.8min
24/02/2021 14:09:35, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.29      0.01      0.02       177
    Bell AKA Bronte Sisters       0.48      0.34      0.40       161
           Charlotte Bronte       0.00      0.00      0.00        13
Edith Rickert & Gleb Botkin       0.00      0.00      0.00        29
               Emily Bronte       0.21      0.04      0.06       426
              Ethel M. Dell       0.60      0.80      0.68      1737
          Fyodor Dostoevsky       0.49      0.52      0.50      1118
                Jane Austen       0.52      0.53      0.52       529
                    Various       0.46      0.40      0.43       435

                   accuracy                           0.54      4625
                  macro avg       0.34      0.29      0.29      4625
               weighted avg       0.49      0.54      0.5

## Decision Tree

```
                             precision    recall  f1-score   support

                Anne Bronte       0.09      0.09      0.09       177
    Bell AKA Bronte Sisters       0.17      0.16      0.17       161
           Charlotte Bronte       0.15      0.23      0.18        13
Edith Rickert & Gleb Botkin       0.15      0.14      0.14        29
               Emily Bronte       0.16      0.16      0.16       426
              Ethel M. Dell       0.56      0.56      0.56      1737
          Fyodor Dostoevsky       0.34      0.33      0.34      1118
                Jane Austen       0.32      0.32      0.32       529
                    Various       0.30      0.32      0.31       435

                   accuracy                           0.38      4625
                  macro avg       0.25      0.26      0.25      4625
               weighted avg       0.38      0.38      0.38      4625

```




In [11]:
from sklearn.tree import DecisionTreeClassifier

In [12]:
pipeline = Pipeline([('s', MinMaxScaler()), ('dt', DecisionTreeClassifier())])
params =  {
          's__feature_range' : [(0, abs(X).max()), (0, 1), (0, 2*abs(X).max())],
          'dt__max_depth': [n for n in range(25, 500, 50)] 
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

24/02/2021 15:03:43, started grid search
Fitting 5 folds for each of 30 candidates, totalling 150 fits
[CV] dt__max_depth=25, s__feature_range=(0, 1.8113512992858887) ......


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  dt__max_depth=25, s__feature_range=(0, 1.8113512992858887), score=0.376, total= 1.1min
[CV] dt__max_depth=25, s__feature_range=(0, 1.8113512992858887) ......


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.1min remaining:    0.0s


[CV]  dt__max_depth=25, s__feature_range=(0, 1.8113512992858887), score=0.375, total= 1.2min
[CV] dt__max_depth=25, s__feature_range=(0, 1.8113512992858887) ......


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.3min remaining:    0.0s


[CV]  dt__max_depth=25, s__feature_range=(0, 1.8113512992858887), score=0.392, total= 1.1min
[CV] dt__max_depth=25, s__feature_range=(0, 1.8113512992858887) ......
[CV]  dt__max_depth=25, s__feature_range=(0, 1.8113512992858887), score=0.396, total= 1.2min
[CV] dt__max_depth=25, s__feature_range=(0, 1.8113512992858887) ......
[CV]  dt__max_depth=25, s__feature_range=(0, 1.8113512992858887), score=0.387, total= 1.1min
[CV] dt__max_depth=25, s__feature_range=(0, 1) .......................
[CV]  dt__max_depth=25, s__feature_range=(0, 1), score=0.380, total= 1.1min
[CV] dt__max_depth=25, s__feature_range=(0, 1) .......................
[CV]  dt__max_depth=25, s__feature_range=(0, 1), score=0.386, total= 1.2min
[CV] dt__max_depth=25, s__feature_range=(0, 1) .......................
[CV]  dt__max_depth=25, s__feature_range=(0, 1), score=0.396, total= 1.1min
[CV] dt__max_depth=25, s__feature_range=(0, 1) .......................
[CV]  dt__max_depth=25, s__feature_range=(0, 1), score=0.397, total

[Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 177.8min finished


24/02/2021 18:02:58, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.09      0.09      0.09       177
    Bell AKA Bronte Sisters       0.17      0.16      0.17       161
           Charlotte Bronte       0.15      0.23      0.18        13
Edith Rickert & Gleb Botkin       0.15      0.14      0.14        29
               Emily Bronte       0.16      0.16      0.16       426
              Ethel M. Dell       0.56      0.56      0.56      1737
          Fyodor Dostoevsky       0.34      0.33      0.34      1118
                Jane Austen       0.32      0.32      0.32       529
                    Various       0.30      0.32      0.31       435

                   accuracy                           0.38      4625
                  macro avg       0.25      0.26      0.25      4625
               weighted avg       0.38      0.38      0.38      4625

Pipeline(memory=None,
         steps=[('s', MinMaxScaler(c

##Random Forest

```

                             precision    recall  f1-score   support

                Anne Bronte       0.60      0.03      0.06       177
    Bell AKA Bronte Sisters       0.52      0.27      0.35       161
           Charlotte Bronte       1.00      0.23      0.38        13
Edith Rickert & Gleb Botkin       0.50      0.10      0.17        29
               Emily Bronte       0.50      0.15      0.23       426
              Ethel M. Dell       0.59      0.82      0.69      1737
          Fyodor Dostoevsky       0.46      0.48      0.47      1118
                Jane Austen       0.54      0.46      0.50       529
                    Various       0.50      0.42      0.45       435

                   accuracy                           0.54      4625
                  macro avg       0.58      0.33      0.37      4625
               weighted avg       0.54      0.54      0.51      4625


```



In [9]:
from sklearn.ensemble import RandomForestClassifier

In [10]:
pipeline = Pipeline([('s', MinMaxScaler()), ('rf', RandomForestClassifier())])
params =  {
          's__feature_range' : [(0, abs(X).max()), (0, 1), (0, 2*abs(X).max())],
          'rf__n_estimators':  [n for n in range(100, 500, 100)],   
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

24/02/2021 21:35:22, started grid search
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] rf__n_estimators=100, s__feature_range=(0, 1.5235536098480225) ..


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  rf__n_estimators=100, s__feature_range=(0, 1.5235536098480225), score=0.521, total= 1.5min
[CV] rf__n_estimators=100, s__feature_range=(0, 1.5235536098480225) ..


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.5min remaining:    0.0s


[CV]  rf__n_estimators=100, s__feature_range=(0, 1.5235536098480225), score=0.534, total= 1.5min
[CV] rf__n_estimators=100, s__feature_range=(0, 1.5235536098480225) ..


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.9min remaining:    0.0s


[CV]  rf__n_estimators=100, s__feature_range=(0, 1.5235536098480225), score=0.525, total= 1.5min
[CV] rf__n_estimators=100, s__feature_range=(0, 1.5235536098480225) ..
[CV]  rf__n_estimators=100, s__feature_range=(0, 1.5235536098480225), score=0.532, total= 1.5min
[CV] rf__n_estimators=100, s__feature_range=(0, 1.5235536098480225) ..
[CV]  rf__n_estimators=100, s__feature_range=(0, 1.5235536098480225), score=0.524, total= 1.5min
[CV] rf__n_estimators=100, s__feature_range=(0, 1) ...................
[CV]  rf__n_estimators=100, s__feature_range=(0, 1), score=0.521, total= 1.5min
[CV] rf__n_estimators=100, s__feature_range=(0, 1) ...................
[CV]  rf__n_estimators=100, s__feature_range=(0, 1), score=0.525, total= 1.4min
[CV] rf__n_estimators=100, s__feature_range=(0, 1) ...................
[CV]  rf__n_estimators=100, s__feature_range=(0, 1), score=0.525, total= 1.4min
[CV] rf__n_estimators=100, s__feature_range=(0, 1) ...................
[CV]  rf__n_estimators=100, s__feature_rang

[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed: 217.2min finished


25/02/2021 01:20:03, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.60      0.03      0.06       177
    Bell AKA Bronte Sisters       0.52      0.27      0.35       161
           Charlotte Bronte       1.00      0.23      0.38        13
Edith Rickert & Gleb Botkin       0.50      0.10      0.17        29
               Emily Bronte       0.50      0.15      0.23       426
              Ethel M. Dell       0.59      0.82      0.69      1737
          Fyodor Dostoevsky       0.46      0.48      0.47      1118
                Jane Austen       0.54      0.46      0.50       529
                    Various       0.50      0.42      0.45       435

                   accuracy                           0.54      4625
                  macro avg       0.58      0.33      0.37      4625
               weighted avg       0.54      0.54      0.51      4625

Pipeline(memory=None,
         steps=[('s',
              

##KNN

```

                             precision    recall  f1-score   support

                Anne Bronte       0.26      0.09      0.13       177
    Bell AKA Bronte Sisters       0.42      0.29      0.34       161
           Charlotte Bronte       0.00      0.00      0.00        13
Edith Rickert & Gleb Botkin       0.21      0.10      0.14        29
               Emily Bronte       0.31      0.24      0.27       426
              Ethel M. Dell       0.57      0.74      0.64      1737
          Fyodor Dostoevsky       0.40      0.37      0.38      1118
                Jane Austen       0.44      0.44      0.44       529
                    Various       0.54      0.34      0.42       435

                   accuracy                           0.49      4625
                  macro avg       0.35      0.29      0.31      4625
               weighted avg       0.47      0.49      0.47      4625
               
```



In [11]:
from sklearn.neighbors import KNeighborsClassifier

In [12]:
pipeline = Pipeline([('s', MinMaxScaler(feature_range=(0, 2*abs(X).max()))), ('knn', KNeighborsClassifier())])
params =  {
          'knn__n_neighbors': [39, 17],
          'knn__leaf_size': [30, 12],   
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

25/02/2021 01:30:58, started grid search
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] knn__leaf_size=30, knn__n_neighbors=39 ..........................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  knn__leaf_size=30, knn__n_neighbors=39, score=0.484, total= 3.6min
[CV] knn__leaf_size=30, knn__n_neighbors=39 ..........................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  3.6min remaining:    0.0s


[CV]  knn__leaf_size=30, knn__n_neighbors=39, score=0.488, total= 3.6min
[CV] knn__leaf_size=30, knn__n_neighbors=39 ..........................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  7.2min remaining:    0.0s


[CV]  knn__leaf_size=30, knn__n_neighbors=39, score=0.479, total= 3.5min
[CV] knn__leaf_size=30, knn__n_neighbors=39 ..........................
[CV]  knn__leaf_size=30, knn__n_neighbors=39, score=0.494, total= 3.6min
[CV] knn__leaf_size=30, knn__n_neighbors=39 ..........................
[CV]  knn__leaf_size=30, knn__n_neighbors=39, score=0.474, total= 3.5min
[CV] knn__leaf_size=30, knn__n_neighbors=17 ..........................
[CV]  knn__leaf_size=30, knn__n_neighbors=17, score=0.485, total= 3.4min
[CV] knn__leaf_size=30, knn__n_neighbors=17 ..........................
[CV]  knn__leaf_size=30, knn__n_neighbors=17, score=0.482, total= 3.4min
[CV] knn__leaf_size=30, knn__n_neighbors=17 ..........................
[CV]  knn__leaf_size=30, knn__n_neighbors=17, score=0.484, total= 3.4min
[CV] knn__leaf_size=30, knn__n_neighbors=17 ..........................
[CV]  knn__leaf_size=30, knn__n_neighbors=17, score=0.488, total= 3.4min
[CV] knn__leaf_size=30, knn__n_neighbors=17 ...................

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 82.7min finished


25/02/2021 02:53:51, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.26      0.09      0.13       177
    Bell AKA Bronte Sisters       0.42      0.29      0.34       161
           Charlotte Bronte       0.00      0.00      0.00        13
Edith Rickert & Gleb Botkin       0.21      0.10      0.14        29
               Emily Bronte       0.31      0.24      0.27       426
              Ethel M. Dell       0.57      0.74      0.64      1737
          Fyodor Dostoevsky       0.40      0.37      0.38      1118
                Jane Austen       0.44      0.44      0.44       529
                    Various       0.54      0.34      0.42       435

                   accuracy                           0.49      4625
                  macro avg       0.35      0.29      0.31      4625
               weighted avg       0.47      0.49      0.47      4625

Pipeline(memory=None,
         steps=[('s',
              

## SVM



```

                             precision    recall  f1-score   support

                Anne Bronte       0.00      0.00      0.00       177
    Bell AKA Bronte Sisters       0.44      0.34      0.39       161
           Charlotte Bronte       0.00      0.00      0.00        13
Edith Rickert & Gleb Botkin       0.00      0.00      0.00        29
               Emily Bronte       0.00      0.00      0.00       426
              Ethel M. Dell       0.59      0.79      0.68      1737
          Fyodor Dostoevsky       0.46      0.52      0.49      1118
                Jane Austen       0.52      0.52      0.52       529
                    Various       0.46      0.39      0.42       435

                   accuracy                           0.53      4625
                  macro avg       0.28      0.29      0.28      4625
               weighted avg       0.45      0.53      0.49      4625
```



In [9]:
from sklearn.svm import SVC

In [11]:
pipeline = Pipeline([('s', MinMaxScaler(feature_range=(0, 1))), ('sv', SVC())])
params =  {
          'sv__kernel': ['linear', 'rbf'],
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

25/02/2021 20:00:25, started grid search
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] sv__kernel=linear ...............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ................... sv__kernel=linear, score=0.524, total=15.9min
[CV] sv__kernel=linear ...............................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 15.9min remaining:    0.0s


[CV] ................... sv__kernel=linear, score=0.539, total=15.9min
[CV] sv__kernel=linear ...............................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 31.8min remaining:    0.0s


[CV] ................... sv__kernel=linear, score=0.524, total=15.9min
[CV] sv__kernel=linear ...............................................
[CV] ................... sv__kernel=linear, score=0.538, total=16.0min
[CV] sv__kernel=linear ...............................................
[CV] ................... sv__kernel=linear, score=0.535, total=15.8min
[CV] sv__kernel=rbf ..................................................
[CV] ...................... sv__kernel=rbf, score=0.506, total=14.3min
[CV] sv__kernel=rbf ..................................................
[CV] ...................... sv__kernel=rbf, score=0.519, total=14.3min
[CV] sv__kernel=rbf ..................................................
[CV] ...................... sv__kernel=rbf, score=0.520, total=14.4min
[CV] sv__kernel=rbf ..................................................
[CV] ...................... sv__kernel=rbf, score=0.534, total=14.3min
[CV] sv__kernel=rbf ..................................................
[CV] .

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 151.2min finished


25/02/2021 22:53:01, grid search complete
                             precision    recall  f1-score   support

                Anne Bronte       0.00      0.00      0.00       177
    Bell AKA Bronte Sisters       0.44      0.34      0.39       161
           Charlotte Bronte       0.00      0.00      0.00        13
Edith Rickert & Gleb Botkin       0.00      0.00      0.00        29
               Emily Bronte       0.00      0.00      0.00       426
              Ethel M. Dell       0.59      0.79      0.68      1737
          Fyodor Dostoevsky       0.46      0.52      0.49      1118
                Jane Austen       0.52      0.52      0.52       529
                    Various       0.46      0.39      0.42       435

                   accuracy                           0.53      4625
                  macro avg       0.28      0.29      0.28      4625
               weighted avg       0.45      0.53      0.49      4625

Pipeline(memory=None,
         steps=[('s', MinMaxScaler(c

  _warn_prf(average, modifier, msg_start, len(result))


##Gradient Boosting

there is no gradient boost.

the session timed out before the first fit.

the session timed out before the first fit completed on the first fold.

the session timed out.

there is no gradient boost. 


In [19]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
pipeline = Pipeline([('s', MinMaxScaler(feature_range=(0,1))), ('gb', GradientBoostingClassifier())])
params =  {
          'gb__max_depth': [10, 500], 
         }

search = GridSearchCV(pipeline, params, cv=5, verbose=3)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, started grid search')
search.fit(X_train, y_train)
print(f'{dt.datetime.now().strftime("%d/%m/%Y %H:%M:%S")}, grid search complete')
    
bow_model = search.best_estimator_

bow_model.fit(X_train, y_train)
y_pred = bow_model.predict(X_test)
    
print(classification_report(y_test, y_pred))
print(bow_model)

26/02/2021 00:32:26, started grid search
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] gb__max_depth=10 ................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


# Conclusion

Oh, Word2Vec. You have failed me in this project. 

The combination of the length of time it takes to build the Word2Vec gensim model, the duration to run a supervised model with the transformations, and the low accuracy scores render Word2Vec a poor solution in regards to modeling for this author multiclassification project.

What else do I have to say about that?

Word2Vec is not an option for this classication project. 

---
*a Thinkful Project by Kalika Kay Curry*
