# The AIgent: Construction of inference models

### The purpose of this notebook is to build regressors to predict genre tags from BERT embeddings of synopses. Data scraping, operation of the BERT language model, and detailed metric modeling are presented in other notebooks. 

Herein, several classifier models are explored. Each uses as its features a set of BERT embeddings; labels are genre tags. Note that these genre tags are user-generated (from Goodreads members). Because of this they have weights attached that can be used to set thresholds of class (ie genre) membership. 

Several classification strategies are common for text, including support vector machines, naive Bayes, and logistic regression. Each of these models is tested for accuracy and F1 scores across genres. Datasets are balanced prior to training for a particular genre tag, and only synopses with >1000 user tags are used in model building. In order to maintain the extensibility of the AIgent (ie to allow for new genre incorporation and to fine-tune genre classifiers), I have used a one-versus-all approach and have built an independent classifier for each genre.

Not all feature experimentation is included in this notebook. In other notebooks, I have tested the effect of synopsis length on classifier strength, seeking to set a balance between the speed of the model (which slows with longer synopses) and the robustness of the model (which slowly increases with longer synopses). 

### DATASETS:

-Full df (with synopses, labels): full-ids-genretags-OHE-synopses-df-m9668-n20.csv

    -Note: full df is already restricted by token length (<250), number of reads (>1000)

-Full embedding data (BERT tokenized): full-features-array-m9668-n20.npy


### Metrics:

-Choice of metric is not fixed for this problem, and should be extended based on the application. As a first pass (ie to validate a model meant to surface similar titles and likely genre tags) I will use the one-versus all performance of classifiers. Because some genres share information (eg fantasy and sci-fi), and because incorrect genre assignment could have adverse effects on user engagement/trust, the most appropriate metric is Precision. Note that in subsequent versions of the AIgent, several updates could be made, including scoring precision across all genres; or precision for the top-1 or top-5 genres. 



# Results: 

### The best performance of the tested classifiers belongs to simple logistic regression models with default hyperparameter settings from Sklearn. While other models (eg SVM) in some cases achieve higher accuracy, those models do not generalize as well to test sets. Likely this is an indication of the robustness of the embeddings. 

Note that this is an experimentation notebook and does not necessarily reflect up-to-date classifier models on the AIgent web application.

In [2]:
!ls production-ready-data

full-features-array-m9668-n20.npy
full-ids-genretags-OHE-synopses-df-m9668-n20.csv
full-ids-genretags-WE-synopses-df-m9668-n20.csv
full-ids-synopses-df-m9668-n20.csv
info.txt
[34mlog-reg-models[m[m


In [1]:
import numpy as np
import pandas as pd
import transformers
import bert
from bert import tokenization
import torch
import transformers as ppb
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics.pairwise import cosine_similarity
import pickle
from joblib import dump, load

In [2]:
#import required files:

path = "production-ready-data/"
labels_df = pd.read_csv(path+"full-ids-genretags-OHE-synopses-df-m9668-n20.csv")

In [160]:
#load tokenizer models (based on distilbert):

tokenizer = ppb.DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = ppb.DistilBertModel.from_pretrained('distilbert-base-uncased')

In [4]:
#method to tokenize and get output from last hidden distilbert layer:

def tokenize_synopsis(input_sentence):
    input_ids = torch.tensor(tokenizer.encode(input_sentence, add_special_tokens=True)).unsqueeze(0)
    outputs = model(input_ids)
    featurized_text = outputs[0][:,0,:].detach().numpy()
    return featurized_text

In [5]:
#to pad:

def padder(tokenized):
    max_len = 0
    for i in tokenized.values:
        if len(i) > max_len:
            max_len = len(i)

    return np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [60]:
#subset the dataframe by genre:

#goodreads_for_tokenizer = labels_df.copy().sample(1500) #draw enough for m=2500 per class, roughly

pos_sample = labels_df[labels_df['science-fiction'] == 1].sample(2500)
neg_sample = labels_df[labels_df['science-fiction'] == 0].sample(2500)
goodreads_for_tokenizer = pd.concat([pos_sample, neg_sample])


In [64]:
#tokenize and pad:

tokenized = goodreads_for_tokenizer['synopsis'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
padded = padder(tokenized)

#build tensors and run through model:
input_ids = torch.tensor(np.array(padded))
attention_mask = np.where(padded != 0, 1, 0)
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
    last_hidden_states = model(input_ids,attention_mask=attention_mask)
features = last_hidden_states[0][:,0,:].numpy()

In [108]:
genres = ['romance', 'contemporary', 'contemporary-romance',
       'adult', 'thriller', 'erotica', 'suspense', 'mystery', 'historical',
       'historical-fiction', 'young-adult', 'fantasy', 'paranormal',
       'new-adult', 'science-fiction', 'adult-fiction', 'chick-lit',
       'womens-fiction']

### Basic Logistic Regression pipeline:

In [121]:
labels = goodreads_for_tokenizer['science-fiction']
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression(max_iter=5000)
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=5000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [122]:
#score model accuracy:

lr_clf.score(test_features, test_labels)

0.848

In [123]:
#f1 score model:

f1_score(lr_clf.predict(test_features), test_labels)

0.8527131782945736

In [124]:
for genre in genres:
    
    #genre-specific labels:
    labels = goodreads_for_tokenizer[genre]
    train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
    lr_clf = LogisticRegression(max_iter=5000)
    lr_clf.fit(train_features, train_labels) #fit the model
    #return a score:
    print(genre + " Accuracy Score= " + str(lr_clf.score(test_features, test_labels)))
    print(genre + " F1 Score= " + str(f1_score(lr_clf.predict(test_features), test_labels)))

    

romance Accuracy Score= 0.776
romance F1 Score= 0.7941176470588236
contemporary Accuracy Score= 0.8
contemporary F1 Score= 0.7058823529411765
contemporary-romance Accuracy Score= 0.904
contemporary-romance F1 Score= 0.6842105263157895
adult Accuracy Score= 0.648
adult F1 Score= 0.6451612903225806
thriller Accuracy Score= 0.856
thriller F1 Score= 0.6250000000000001
erotica Accuracy Score= 0.904
erotica F1 Score= 0.6470588235294117
suspense Accuracy Score= 0.792
suspense F1 Score= 0.606060606060606
mystery Accuracy Score= 0.736
mystery F1 Score= 0.4923076923076923
historical Accuracy Score= 0.936
historical F1 Score= 0.6666666666666666
historical-fiction Accuracy Score= 0.92
historical-fiction F1 Score= 0.6428571428571429
young-adult Accuracy Score= 0.744
young-adult F1 Score= 0.6363636363636364
fantasy Accuracy Score= 0.824
fantasy F1 Score= 0.872093023255814
paranormal Accuracy Score= 0.792
paranormal F1 Score= 0.638888888888889
new-adult Accuracy Score= 0.864
new-adult F1 Score= 0.604

In [204]:
path = "balanced_lr_models/"
dump(lr_clf, path+'balanced_lr_clf'+'_adult-fiction'+'.joblib')

['balanced_lr_models/balanced_lr_clf_adult-fiction.joblib']

## Expensive: try longer token lengths

In [219]:
goodreads_we = pd.read_csv("goodreads_warm_encoded_20.csv")
goodreads_we_copy = goodreads_we.copy()

In [146]:
def one_hottify(input_df, data_columns=list):
    input_df_columns = list(input_df.columns)
    one_hot = input_df[data_columns].astype(bool).astype(int)
    new_cols = [i for i in input_df_columns + data_columns if i not in input_df_columns or i not in data_columns]
    one_hot[new_cols] = input_df[new_cols]
    return one_hot

goodreads_ohe = one_hottify(goodreads_we, genres)

## Rather than one-hot based simply on tag presence/absence, encode using a threshold of .05 (ie if <5% of tags indicate a genre, code as 0)

In [220]:
input_df_columns = list(goodreads_we.columns)
#threshold-encoded:
goodreads_we_copy[genres] = goodreads_we_copy[genres].apply(lambda x: [0 if y <= .05 else 1 for y in x])


In [221]:
goodreads_we.iloc[0]

Unnamed: 0                                                              0
romance                                                          0.289474
contemporary                                                     0.114035
contemporary-romance                                            0.0438596
adult                                                            0.105263
thriller                                                        0.0526316
erotica                                                          0.210526
suspense                                                         0.184211
mystery                                                                 0
historical                                                              0
historical-fiction                                                      0
young-adult                                                             0
fantasy                                                                 0
paranormal                            

In [222]:
goodreads_we_copy.iloc[0] ##encoded at 5% threshold

Unnamed: 0                                                              0
romance                                                                 1
contemporary                                                            1
contemporary-romance                                                    0
adult                                                                   1
thriller                                                                1
erotica                                                                 1
suspense                                                                1
mystery                                                                 0
historical                                                              0
historical-fiction                                                      0
young-adult                                                             0
fantasy                                                                 0
paranormal                            

In [None]:
def one_hottify(input_df, data_columns=list):
    input_df_columns = list(input_df.columns)
    one_hot = input_df[data_columns].astype(bool).astype(int)
    new_cols = [i for i in input_df_columns + data_columns if i not in input_df_columns or i not in data_columns]
    one_hot[new_cols] = input_df[new_cols]
    return one_hot

goodreads_ohe = one_hottify(goodreads_we, genres)

In [223]:
goodreads_we_copy['token_length'] = [len(tokenizer.encode(sentence)) 
                             for sentence in goodreads_we_copy['synopsis']]

Token indices sequence length is longer than the specified maximum sequence length for this model (872 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (658 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (711 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (683 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (570 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Token indices sequence length is longer than the specified maximum sequence length for this model (600 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (659 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (749 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (765 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (576 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Token indices sequence length is longer than the specified maximum sequence length for this model (664 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (594 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (645 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (674 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (754 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Token indices sequence length is longer than the specified maximum sequence length for this model (514 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (818 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (691 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (696 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (551 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Token indices sequence length is longer than the specified maximum sequence length for this model (600 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (620 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (576 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (609 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (813 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Token indices sequence length is longer than the specified maximum sequence length for this model (550 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (514 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (561 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (611 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (623 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

In [147]:
goodreads_ohe['token_length'] = [len(tokenizer.encode(sentence)) 
                             for sentence in goodreads_ohe['synopsis']]

Token indices sequence length is longer than the specified maximum sequence length for this model (872 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (658 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (711 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (683 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (570 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Token indices sequence length is longer than the specified maximum sequence length for this model (600 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (659 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (749 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (765 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (576 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Token indices sequence length is longer than the specified maximum sequence length for this model (664 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (594 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (645 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (674 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (754 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Token indices sequence length is longer than the specified maximum sequence length for this model (514 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (818 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (691 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (696 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (551 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Token indices sequence length is longer than the specified maximum sequence length for this model (600 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (620 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (576 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (609 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (813 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Token indices sequence length is longer than the specified maximum sequence length for this model (550 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (514 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (561 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (611 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (623 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

In [148]:
MAX_LENGTH = 300
goodreads_ohe_trimmed = goodreads_ohe[goodreads_ohe['token_length'] < MAX_LENGTH]


In [151]:
len(goodreads_ohe_trimmed)

12803

In [224]:
df_500 = goodreads_ohe[goodreads_ohe['token_length'] < 500]
df_400 = goodreads_ohe[goodreads_ohe['token_length'] < 400]
df_300 = goodreads_ohe[goodreads_ohe['token_length'] < 300]
df_200 = goodreads_ohe[goodreads_ohe['token_length'] < 200]

goodreads_we_thresh_500 = goodreads_we_copy[goodreads_we_copy['token_length'] < 500]
goodreads_we_thresh_400 = goodreads_we_copy[goodreads_we_copy['token_length'] < 400]
goodreads_we_thresh_300 = goodreads_we_copy[goodreads_we_copy['token_length'] < 300]
goodreads_we_thresh_200 = goodreads_we_copy[goodreads_we_copy['token_length'] < 200]



## Balance datasets:

In [187]:
pos_sample = df_200[df_200['romance'] == 1].sample(100)
neg_sample = df_200[df_200['romance'] == 0].sample(100)
goodreads_for_tokenizer = pd.concat([pos_sample, neg_sample])

In [190]:
pos_sample = df_300[df_300['romance'] == 1].sample(100)
neg_sample = df_300[df_300['romance'] == 0].sample(100)
goodreads_for_tokenizer = pd.concat([pos_sample, neg_sample])

In [193]:
pos_sample = df_400[df_400['romance'] == 1].sample(100)
neg_sample = df_400[df_400['romance'] == 0].sample(100)
goodreads_for_tokenizer = pd.concat([pos_sample, neg_sample])

In [196]:
pos_sample = df_500[df_500['romance'] == 1].sample(100)
neg_sample = df_500[df_500['romance'] == 0].sample(100)
goodreads_for_tokenizer = pd.concat([pos_sample, neg_sample])

In [348]:
pos_sample = goodreads_we_thresh_500[goodreads_we_thresh_500['adult'] == 1].sample(100)
neg_sample = goodreads_we_thresh_500[goodreads_we_thresh_500['adult'] == 0].sample(100)
goodreads_for_tokenizer = pd.concat([pos_sample, neg_sample])

In [349]:
#tokenize and pad:

tokenized = goodreads_for_tokenizer['synopsis'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
padded = padder(tokenized)

#build tensors and run through model:
input_ids = torch.tensor(np.array(padded))
attention_mask = np.where(padded != 0, 1, 0)
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
    last_hidden_states = model(input_ids,attention_mask=attention_mask)
features = last_hidden_states[0][:,0,:].numpy()

In [246]:
goodreads_we.columns

Index(['Unnamed: 0', 'romance', 'contemporary', 'contemporary-romance',
       'adult', 'thriller', 'erotica', 'suspense', 'mystery', 'historical',
       'historical-fiction', 'young-adult', 'fantasy', 'paranormal',
       'new-adult', 'science-fiction', 'adult-fiction', 'chick-lit',
       'womens-fiction', 'synopsis', 'id'],
      dtype='object')

In [337]:
labels = goodreads_for_tokenizer['womens-fiction']
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression(max_iter=2000)
lr_clf.fit(train_features, train_labels) #
print(" Accuracy Score= " + str(lr_clf.score(test_features, test_labels)))
print(" F1 Score= " + str(f1_score(lr_clf.predict(test_features), test_labels)))

 Accuracy Score= 0.88
 F1 Score= 0.8749999999999999


In [395]:
from sklearn.metrics import average_precision_score
#average_precision = average_precision_score(y_test, y_score)
y_score = lr_clf.decision_function(test_features)
average_precision = average_precision_score(test_labels, y_score)

In [396]:
average_precision

0.9300797346022579

In [339]:
path = "balanced_lr_models/token-500-models/"
dump(lr_clf, path+'balanced_lr_clf'+'_womens-fiction-token-500-thresholded'+'.joblib')

['balanced_lr_models/token-500-models/balanced_lr_clf_womens-fiction-token-500-thresholded.joblib']

In [358]:
!ls balanced_lr_models/token-500-models/

balanced_lr_clf_adult-fiction-token-500-thresholded.joblib
balanced_lr_clf_chick-lit-token-500-thresholded.joblib
balanced_lr_clf_contemporary-romance-token-500-thresholded.joblib
balanced_lr_clf_contemporary-token-500-thresholded.joblib
balanced_lr_clf_erotica-token-500-thresholded.joblib
balanced_lr_clf_fantasy-token-500-thresholded.joblib
balanced_lr_clf_historical-fiction-token-500-thresholded.joblib
balanced_lr_clf_historical-token-500-thresholded.joblib
balanced_lr_clf_mystery-token-500-thresholded.joblib
balanced_lr_clf_new-adult-token-500-thresholded.joblib
balanced_lr_clf_paranormal-token-500-thresholded.joblib
balanced_lr_clf_romance-token-500-thresholded.joblib
balanced_lr_clf_science-fiction-token-500-thresholded.joblib
balanced_lr_clf_suspense-token-500-thresholded.joblib
balanced_lr_clf_thriller-token-500-thresholded.joblib
balanced_lr_clf_womens-fiction-token-500-thresholded.joblib
balanced_lr_clf_young-adult-token-500-thresholded.joblib


In [365]:
from typing import Dict

lr_path = 'balanced_lr_models/token-500-models/'


In [366]:
lr_models: Dict = {
            genre: load(f'{lr_path}balanced_lr_clf_{genre}-token-500-thresholded.joblib')
            for genre in genre_list
        }

In [360]:
genre_list = [
    'romance', 'mystery', 'science-fiction', 'erotica', 'historical-fiction',
    'paranormal', 'fantasy', 'adult-fiction', 'chick-lit',
    'contemporary-romance', 'contemporary', 'new-adult', 'suspense',
    'thriller', 'womens-fiction', 'young-adult'
]

In [373]:
accuracies = {"romance":.86, "science fiction":.88, "thriller":.82, "erotica":.79, 
              "suspense":.81,  
             "mystery":.83, "paranormal":.74 ,"chick-lit":.78}

In [383]:
from bokeh.io import show, output_file
from bokeh.plotting import figure


In [436]:
fruits = ["romance", "science fiction", "thriller", "erotica", "suspense", "mystery",
         "paranormal", "chick-lit"]
#counts = [.86, .88, .82, .79, .81, .83, .74, .78] #accuracies
counts = [.85,.89,.82,.8,.81,.81,.75,.76]


In [446]:
p = figure(x_range=fruits, title="Prediction accuracy by genre", plot_width=800, plot_height=300)
p.vbar(x=fruits, top=counts, width=0.9)
p.axis.major_label_text_font_size = "12pt"
p.title.text_font_size = '14pt'
show(p)

In [419]:
features = np.load("production-ready-data/full-features-array-m9668-n20.npy")
book_data = pd.read_csv("production-ready-data/full-ids-synopses-df-m9668-n20.csv")
full_df = pd.read_csv("/Users/ryan/Projects/Insight/insight/Agent-2/agent-2/streamlit-agent/goodreads_top40k_with_ranks.csv")

In [421]:
full_df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,source_book_id,source_book_series_id,source_book_series_name,source_title_desc,source_book_url,published_year,author_names,publisher_names,...,user_rating_cnt,description,source_person_id,source_person_desc,source_person_url,person_user_rating_avg,person_user_rating_cnt,person_user_review_cnt,summary_language,user_ratings_rank
0,0,49,52190546,,,Swallowing Mercury,https://www.goodreads.com/work/best_book/52190546,2014,"[Wioletta Greg, Wioletta Grzegorzewska, Eliza ...",[Portobello Books],...,1199,Wiola lives in a close-knit agricultural commu...,16121942,Eliza Marciniak,https://www.goodreads.com/author/show/16121942...,7.50,914,205,English,0.112812
1,1,107,52187979,,,Take Me with You,https://www.goodreads.com/work/best_book/52187979,2016,[Nina G. Jones],[null],...,1696,I watch. I study. I prowl. I hunt. I always go...,7149925,Nina G. Jones,https://www.goodreads.com/author/show/7149925....,8.22,17575,2372,English,0.415711
2,2,202,52184180,194405.0,The Inspector de Silva Mysteries #1,Trouble in Nuala,https://www.goodreads.com/work/best_book/52184180,2016,[Harriet Steel],[Stane Street Press],...,1053,When Inspector Shanti de Silva moves with his ...,5423444,Harriet Steel,https://www.goodreads.com/author/show/5423444....,7.78,3299,374,English,0.205336
3,3,472,21404389,134050.0,Faking Normal #1,Faking Normal,https://www.goodreads.com/work/best_book/21404389,2014,[Courtney C. Stevens],[HarperTeen],...,7630,Alexi Littrell hasn't told anyone what happene...,6431805,Courtney C. Stevens,https://www.goodreads.com/author/show/6431805....,8.04,14079,2381,English,0.382592
4,4,796,21404513,107870.0,Birthright #1,Darkest Fear,https://www.goodreads.com/work/best_book/21404513,2014,[Cate Tiernan],[Simon Pulse],...,1050,Vivi’s animal instincts are her legacy—and may...,191456,Cate Tiernan,https://www.goodreads.com/author/show/191456.C...,8.30,219416,8811,English,0.861968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41768,41768,1031864,68454173,,,The Lady and the Highwayman,https://www.goodreads.com/work/best_book/68454173,2019,[Sarah M. Eden],[Shadow Mountain],...,1307,Elizabeth Black is the headmistress of a girls...,2515462,Sarah M. Eden,https://www.goodreads.com/author/show/2515462....,8.12,92120,15897,English,0.723003
41769,41769,1031896,60637050,,,Turbulence,https://www.goodreads.com/work/best_book/60637050,2018,[David Szalay],[Scribner],...,2500,"From the acclaimed, Man Booker Prize-shortlist...",1360792,David Szalay,https://www.goodreads.com/author/show/1360792....,7.32,9356,1389,English,0.323163
41770,41770,1031899,57863355,240053.0,Dark Lives #1,Fifty Years of Fear,https://www.goodreads.com/work/best_book/57863355,2017,[Ross Greenwood],[null],...,1190,"""At last a page turner novel that doesn't disa...",7999731,Ross Greenwood,https://www.goodreads.com/author/show/7999731....,8.20,2632,474,English,0.184401
41771,41771,1031900,57868416,193026.0,Legacy #3,Renegade Magic,https://www.goodreads.com/work/best_book/57868416,2017,[McKenzie Hunter],[Sky Publishing LLC],...,1664,Secrets have consequences—and so do truths. I’...,8295601,McKenzie Hunter,https://www.goodreads.com/author/show/8295601....,8.16,20716,1447,English,0.443145
