# __Step 5: Make predictions and interpret model__

The best model is Word2Vec:
- Because the performance is all very similar, for interpretability purpose, choose to focus on the Word2Vec-based model with ` [min_count, window, n_gram] = [20, 8, 3]`. 
- This way there is a smaller set of eature (because min_count is high) that include tri-grams (so 3 word combinations that help with interpretation).

Also for interpretability purpose:
- Tf-Idf with `[max_features, ngram_range, p_threshold]=[10000.0, (1, 3), 0.01]` are also examined.
- This is because the Tf-Idf values is immediately interpretable and the model is based on XGBoost and I can get feature importance using SHAP.

Goals: I am interested in:
- Model interpretation
  - Figure out how top words that define plant sciences are related to each other. 
  - See if there is certain date range, journal are more challenging to predict.
- Make predictions of the entire corupus


## ___Set up___

### Module import

In [1]:
import json
import pandas as pd
import numpy as np
import pickle
import sys
import itertools
from pathlib import Path

from sklearn import model_selection, metrics

## for word embedding with w2v
import gensim

from tensorflow.keras import models, layers, callbacks, preprocessing

import script_2_3_text_classify_w2v as script23

### Key variables

In [30]:
# Reproducibility
seed = 20220609

# Setting paths
work_dir   = Path.home() / "projects/plant_sci_hist/2_text_classify"

os.chdir(work_dir)

# Training data for interpretation purpose
corpus_train = work_dir / "corpus_train.json"

# The columns to focus on
target_col = 'txt'

# Trainded Word2Vec model, tokenizer, and vocab for getting embeddings
w2v_name   = work_dir / f"model_cln_w2v_20-8-3"
tok_name   = work_dir / f"model_cln_w2v_token_20-8-3"
vocab_name = work_dir / f"model_cln_w2v_vocab_20-8-3"

# Getting ngrams
ngram = 3
min_count = 20

# DNN checkpoint path
cp_filepath = work_dir / f"model_cln_w2v_20-8-3_dnn"

# Corpus to make predictions for
corpus_dir  = Path.home() / "projects/plant_sci_hist/1_obtaining_corpus"
corpus_file = corpus_dir / "pubmed_qualified.tsv"

## ___Analysis of prediction outcome___

### Load w2v model, tokenizer, and vocab

Need:
- W2V model
- Tokenizer and vocab
- Trained DNN model

In [25]:
# Load word2vec model
with open(w2v_name, "rb") as f:
  model_w2v = pickle.load(f)
model_w2v

<gensim.models.word2vec.Word2Vec at 0x7f74831da350>

In [26]:
# Load tokenzier and vocab
with open(tok_name, "rb") as f:
  tokenizer = pickle.load(f)

with open(vocab_name, "rb") as f:
  vocab = pickle.load(f)

### Get training split

In [34]:
with corpus_train.open("r+") as f:
  corpus_combo_json = json.load(f)

UnpicklingError: invalid load key, '"'.

## __Make predictions on the whole dataset__

### Read corpus that needs to be predicted

In [None]:
corpus_df_raw = pd.read_csv(corpus_file, delimiter='\t')

(1497511, 6)

In [None]:
# Drop duplicated rows
corpus_df = corpus_df_raw[corpus_df_raw.duplicated() == False]

# Rid of all records with NAs
corpus_df = corpus_df.dropna(axis=0)

# Create a new column 'txt' which is concatenated between 'Title' and 'Abstract'
corpus_df['txt'] = corpus_df['Title'] + " " + corpus_df['Abstract']

In [None]:
corpus_df.head(3)

Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,txt
0,36,1975-11-01,The British journal of nutrition,The effects of processing of barley-based supp...,1. In one experiment the effect on rumen pH of...,barley,The effects of processing of barley-based supp...
1,52,1975-12-02,Biochemistry,Evidence of the involvement of a 50S ribosomal...,The functional role of the Bacillus stearother...,rose,Evidence of the involvement of a 50S ribosomal...
2,60,1975-12-11,Biochimica et biophysica acta,The reaction between the superoxide anion radi...,1. The superoxide anion radical (O2-) reacts w...,tuna,The reaction between the superoxide anion radi...


In [None]:
corpus_df['txt'][0]

'The effects of processing of barley-based supplements on rumen pH, rate of digestion of voluntary intake of dried grass in sheep. 1. In one experiment the effect on rumen pH of feeding with restricted amounts of whole or pelleted barley was studied. With whole barley there was little variation in rumen pH associated with feeding time, but with pelleted barley the pH decreased from about 7-0 before feeding to about 5-3, 2--3 h after feeding. 2. The rate of disappearance of dried grass during incubation in the rumens of sheep receiving either whole or pelleted barley was studied in a second experiment. After 24 h incubation only 423 mg/g incubated had disappeared in the rumen of sheep receiving pelleted barley while 625 mg/g incubated had disappeared when it was incubated in the rumen of sheep receiving whole barley. 3. The voluntary intake of dried grass of lambs was studied in a third experiment when they received supplements of either 25 or 50 g whole or pelleted barley/kg live weigh

### Get word embeddings, w2v feature matrix using corpus

In [8]:
# Get ngrams
X        = corpus[target_col]
X_ngrams = script23.get_ngram(X, ngram, min_count)

In [9]:
# Get embeddings
embeddings, X_w2v = script23.get_embeddings(X, model_w2v, tokenizer, vocab)

### Load model

In [10]:
model = script23.get_w2v_emb_model(embeddings)
model.load_weights(cp_filepath)

2022-06-21 13:40:00.933422: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-06-21 13:40:00.985457: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-06-21 13:40:00.985950: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-06-21 13:40:00.988692: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f747a7924d0>


### Make predictions 

In [12]:
y_pred_prob   = model.predict(X_w2v)
dic_y_mapping = {n:label for n,label in enumerate(np.unique([0,1]))}
y_pred        = [dic_y_mapping[np.argmax(pred)] for pred in y_pred_prob]


KeyboardInterrupt: 