# __2.6 Prediction analysis__

Main questions:
- What is the overall performance of model?
- What features are important for predictions?
- What are the reasons for mispredictions?
  - Are there any patterns among predicted instances?
  - E.g., publication date, journal.

## ___Setup___

### Import

In [1]:
import json
import pandas as pd
from pathlib import Path

### Configurations

In [2]:
work_dir     = Path.home() / 'projects/plant_sci_hist/2_text_classify'
corpus_combo = work_dir / "corpus_combo"
train_pred   = work_dir / "corpus_train_pred"
valid_pred   = work_dir / "corpus_valid_pred"
test_pred    = work_dir / "corpus_test_pred"

### Consolidate predictions with original dataframe

In [3]:
def json_to_df(json_file):
  with json_file.open("r+") as f:
      json_loaded = json.load(f)
  df = pd.read_json(json_loaded)
  
  return df

In [4]:
# Get the original corpus
corpus_ori = json_to_df(corpus_combo)
corpus_ori.shape

(86646, 9)

In [5]:
# Load the prediction TSV files as dataframes
train_pred_df = pd.read_csv(train_pred, sep='\t', index_col=0)
valid_pred_df = pd.read_csv(valid_pred, sep='\t', index_col=0)
test_pred_df  = pd.read_csv(test_pred , sep='\t', index_col=0)

In [6]:
train_pred_df["subset"] = "train"
valid_pred_df["subset"] = "valid"
test_pred_df["subset"]  = "test"

In [7]:
train_pred_df.sample(3)

Unnamed: 0,y,y_pred,y_prob,X,subset
695692,0,0,0.000553,improving induction therapy multiple myeloma s...,train
1248319,0,1,0.96695,"characterization two arabidopsis lgulono1,4lac...",train
318231,0,0,0.000422,analysis gastrointestinal amyloidosis 78 patie...,train


In [8]:
# Make sure the indices between two dataframes are consistent
print(train_pred_df.loc[695692].X[:100])
print(corpus_ori.loc[695692].txt[:100])

improving induction therapy multiple myeloma significant improvement induction therapy multiple myel
Improving induction therapy in multiple myeloma.. Significant improvements in induction therapy for 


In [9]:
# Make sure the indices between two dataframes are consistent
print(train_pred_df.loc[1248319].X[:100])
print(corpus_ori.loc[1248319].txt[:100])

characterization two arabidopsis lgulono1,4lactone oxidases, atgullo3 atgullo5, involved ascorbate b
Characterization of Two Arabidopsis L-Gulono-1,4-lactone Oxidases, AtGulLO3 and AtGulLO5, Involved i


In [10]:
# Concatenate prediction dataframes
pred_df = pd.concat([train_pred_df, valid_pred_df, test_pred_df])
pred_df.shape

(86646, 5)

In [12]:
pred_df.head()

Unnamed: 0,y,y_pred,y_prob,X,subset
516651,1,1,0.98218,vivo vitro inhibition catalase leaf nicotiana ...,train
521301,1,1,0.993828,pathway glucose regulation monosaccharide tran...,train
65516,0,0,0.001417,feasibility home treatment diarrhoea packaged ...,train
277058,1,1,0.990589,modulation phosphatidylcholine biosynthesis ce...,train
753225,1,1,0.921157,120yr period dr beals seed viability experimen...,train


In [17]:
# Concatenate prediction dataframe and the original dataframe
combo_pred = pd.concat([corpus_ori, pred_df], axis=1)
combo_pred.head(2)

Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,txt,label,txt_clean,y,y_pred,y_prob,X,subset
600447,18467466,2008-05-10,Plant physiology,The Arabidopsis halophytic relative Thellungie...,A comprehensive knowledge of mechanisms regula...,plant,The Arabidopsis halophytic relative Thellungie...,1,arabidopsis halophytic relative thellungiella ...,1,1,0.993899,arabidopsis halophytic relative thellungiella ...,train
583302,18065557,2007-12-11,Plant physiology,An Arabidopsis purple acid phosphatase with ph...,Ascorbate (AsA) is the most abundant antioxida...,plant,An Arabidopsis purple acid phosphatase with ph...,1,arabidopsis purple acid phosphatase phytase ac...,1,1,0.995514,arabidopsis purple acid phosphatase phytase ac...,train


In [16]:
# Check the label and y columns again to make sure they are exactly the same
combo_pred['label'].equals(combo_pred['y'])

True

In [19]:
# Drop the uncessary columns
combo_pred = combo_pred.drop(['txt', 'txt_clean', 'y', 'X'], axis=1)
combo_pred.head(2)

Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,label,y_pred,y_prob,subset
600447,18467466,2008-05-10,Plant physiology,The Arabidopsis halophytic relative Thellungie...,A comprehensive knowledge of mechanisms regula...,plant,1,1,0.993899,train
583302,18065557,2007-12-11,Plant physiology,An Arabidopsis purple acid phosphatase with ph...,Ascorbate (AsA) is the most abundant antioxida...,plant,1,1,0.995514,train


## ___Mis-predictions___

### Identify mis-predicted entries

In [23]:
val_l = combo_pred['label'].values
val_y = combo_pred['y'].values

In [24]:
correct = [val_l[i] == val_y[i] for i in range(len(val_l)) ]

In [42]:
combo_pred_b[~combo_pred_b['correct']]

Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,label,y_pred,y_prob,subset,correct


In [41]:
combo_pred_b.loc[600447]

PMID                                                      18467466
Date                                                    2008-05-10
Journal                                           Plant physiology
Title            The Arabidopsis halophytic relative Thellungie...
Abstract         A comprehensive knowledge of mechanisms regula...
QualifiedName                                                plant
label                                                            1
y_pred                                                           1
y_prob                                                    0.993899
subset                                                       train
correct                                                       True
Name: 600447, dtype: object