# Complement Naive Bayes and Prediction
With the TF-IDF data ready, we can move into the model making and prediction. In the final version, this will be a part of the pipeline and will run training and test sets separately.

In [29]:
random_seed = 42

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from IPython.display import clear_output
from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import ComplementNB
from sklearn import metrics

import pickle

In [3]:
# get pickled TF-IDF data and vectorizer
path_to_file = "Data/pickles/TFIDF.dat"
with open (path_to_file, "rb") as f:
    tfidf_data, tfidf_vectorizer = pickle.load(f)

In [7]:
# get dataframe with summaries and genres
path_to_data = "Data/cleaned_summaries_and_genres.csv"
df = pd.read_csv(path_to_data, index_col=0)
df

Unnamed: 0,summary,genre
0,old major old boar manor farm call animal farm...,Children's literature
1,old major old boar manor farm call animal farm...,Speculative fiction
2,old major old boar manor farm call animal farm...,Fiction
3,alex teenager live nearfuture england lead gan...,Science Fiction
4,alex teenager live nearfuture england lead gan...,Speculative fiction
...,...,...
26536,series follow character nick stone exmilitary ...,Fiction
26537,series follow character nick stone exmilitary ...,Suspense
26538,reader first meet rapp covert operation iran d...,Thriller
26539,reader first meet rapp covert operation iran d...,Fiction


In [14]:
# encode genres into labels and make X and y
le = LabelEncoder()

X = tfidf_data

y_names = df['genre']
y = le.fit_transform(y_names)
y

array([ 2, 15,  6, ..., 18,  6, 15])

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)
print("Array Shapes\nX_train: {}  y_train: {}\nX_test:  {}   y_test: {}"\
      .format(X_train.shape, y_train.shape, X_test.shape, y_test.shape))

Array Shapes
X_train: (18578, 116360)  y_train: (18578,)
X_test:  (7963, 116360)   y_test: (7963,)


## Complement Naive Bayes
The complement naive bayes analyzer has historically good performance with textual analysis.

In [23]:
clf = ComplementNB()
clf.fit(X_train, y_train)

ComplementNB()

In [25]:
preds = clf.predict(X_test)

In [31]:
print(metrics.classification_report(y_test, preds, target_names=le.classes_))

                        precision    recall  f1-score   support

       Adventure novel       0.07      0.02      0.03        94
     Alternate history       0.06      0.01      0.02        78
 Children's literature       0.14      0.07      0.10       666
         Crime Fiction       0.12      0.05      0.07       236
     Detective fiction       0.07      0.03      0.04       103
               Fantasy       0.17      0.15      0.16       690
               Fiction       0.14      0.32      0.19      1403
    Historical fiction       0.00      0.00      0.00       113
      Historical novel       0.18      0.08      0.11       204
                Horror       0.02      0.01      0.01       142
               Mystery       0.09      0.06      0.07       436
           Non-fiction       0.22      0.03      0.05        73
                 Novel       0.15      0.07      0.09       751
         Romance novel       0.04      0.01      0.01       135
       Science Fiction       0.24      

In [34]:
print(metrics.confusion_matrix(y_test, preds))


[[  2   1  11   0   0   3  34   0   4   0   1   0   5   0   8  23   0   0
    0   2]
 [  0   1   0   0   0   3  16   0   3   0   1   0   1   0  22  31   0   0
    0   0]
 [  3   0  49   1   4  80 328   0   6   6  13   1   9   1  27 121   2   0
    1  14]
 [  0   0   0  12   1   0 172   0   2   0  30   0   5   1   1   5   0   3
    4   0]
 [  0   0   4   3   3   0  41   0   0   0  46   0   2   0   1   1   0   1
    0   1]
 [  1   1  49   0   0 102 132   0   0   3   6   0   2   1  43 337   0   1
    4   8]
 [  6   2 113  23   2  86 453   9  23   8  69   2 153  10 113 292   6  18
   10   5]
 [  0   0   3   1   0   1  82   0  11   0   0   0   7   1   0   5   0   0
    0   2]
 [  6   0   9   3   0   4 121   5  16   1   6   2  10   1   0  15   1   0
    0   4]
 [  0   0   1   0   0  11  40   1   0   1   2   0   3   0   5  74   0   1
    0   3]
 [  0   0  12  37  25   6 256   1   2   2  24   0   9   4   7  33   0  16
    1   1]
 [  0   0   1   0   0   0  58   0   0   0   0   2   4   0   7   1