# __2.6 Prediction analysis__

Main questions:
- What is the overall performance of model?
- What features are important for predictions?
- What are the reasons for mispredictions?
  - Are there any patterns among predicted instances?
  - E.g., publication date, journal.

## ___Setup___

### Import

In [1]:
import json
import pandas as pd
from pathlib import Path

### Configurations

In [2]:
work_dir     = Path.home() / 'projects/plant_sci_hist/2_text_classify'
corpus_combo = work_dir / "corpus_combo"
train_pred   = work_dir / "corpus_train_pred"
valid_pred   = work_dir / "corpus_valid_pred"
test_pred    = work_dir / "corpus_test_pred"

### Consolidate predictions with original dataframe

In [3]:
def json_to_df(json_file):
  with json_file.open("r+") as f:
      json_loaded = json.load(f)
  df = pd.read_json(json_loaded)
  
  return df

In [4]:
# Get the original corpus
corpus_ori = json_to_df(corpus_combo)
corpus_ori.shape

(86646, 9)

In [5]:
# Load the prediction TSV files as dataframes
train_pred_df = pd.read_csv(train_pred, sep='\t', index_col=0)
valid_pred_df = pd.read_csv(valid_pred, sep='\t', index_col=0)
test_pred_df  = pd.read_csv(test_pred , sep='\t', index_col=0)

In [6]:
train_pred_df["subset"] = "train"
valid_pred_df["subset"] = "valid"
test_pred_df["subset"]  = "test"

In [7]:
train_pred_df.sample(3)

Unnamed: 0,y,y_pred,y_prob,X,subset
1087555,0,0,0.000225,measure aggression victimization portuguese ad...,train
263262,0,0,0.003348,positive listsnegative list viewpoint clinical...,train
458896,1,1,0.966405,phosphorylation sphingoid longchain base arabi...,train


In [8]:
# Make sure the indices between two dataframes are consistent
print(train_pred_df.loc[695692].X[:100])
print(corpus_ori.loc[695692].txt[:100])

improving induction therapy multiple myeloma significant improvement induction therapy multiple myel
Improving induction therapy in multiple myeloma.. Significant improvements in induction therapy for 


In [9]:
# Make sure the indices between two dataframes are consistent
print(train_pred_df.loc[1248319].X[:100])
print(corpus_ori.loc[1248319].txt[:100])

characterization two arabidopsis lgulono1,4lactone oxidases, atgullo3 atgullo5, involved ascorbate b
Characterization of Two Arabidopsis L-Gulono-1,4-lactone Oxidases, AtGulLO3 and AtGulLO5, Involved i


In [10]:
# Concatenate prediction dataframes
pred_df = pd.concat([train_pred_df, valid_pred_df, test_pred_df])
pred_df.shape

(86646, 5)

In [11]:
pred_df.head()

Unnamed: 0,y,y_pred,y_prob,X,subset
516651,1,1,0.98218,vivo vitro inhibition catalase leaf nicotiana ...,train
521301,1,1,0.993828,pathway glucose regulation monosaccharide tran...,train
65516,0,0,0.001417,feasibility home treatment diarrhoea packaged ...,train
277058,1,1,0.990589,modulation phosphatidylcholine biosynthesis ce...,train
753225,1,1,0.921157,120yr period dr beals seed viability experimen...,train


In [24]:
# Concatenate prediction dataframe and the original dataframe
combo_pred = pd.concat([corpus_ori, pred_df], axis=1)
combo_pred.head(2)

Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,txt,label,txt_clean,y,y_pred,y_prob,X,subset
600447,18467466,2008-05-10,Plant physiology,The Arabidopsis halophytic relative Thellungie...,A comprehensive knowledge of mechanisms regula...,plant,The Arabidopsis halophytic relative Thellungie...,1,arabidopsis halophytic relative thellungiella ...,1,1,0.993899,arabidopsis halophytic relative thellungiella ...,train
583302,18065557,2007-12-11,Plant physiology,An Arabidopsis purple acid phosphatase with ph...,Ascorbate (AsA) is the most abundant antioxida...,plant,An Arabidopsis purple acid phosphatase with ph...,1,arabidopsis purple acid phosphatase phytase ac...,1,1,0.995514,arabidopsis purple acid phosphatase phytase ac...,train


In [25]:
# Check the label and y columns again to make sure they are exactly the same
combo_pred['label'].equals(combo_pred['y'])

True

In [26]:
# Drop the uncessary columns
combo_pred = combo_pred.drop(['txt', 'txt_clean', 'y', 'X'], axis=1)
combo_pred.head(2)

Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,label,y_pred,y_prob,subset
600447,18467466,2008-05-10,Plant physiology,The Arabidopsis halophytic relative Thellungie...,A comprehensive knowledge of mechanisms regula...,plant,1,1,0.993899,train
583302,18065557,2007-12-11,Plant physiology,An Arabidopsis purple acid phosphatase with ph...,Ascorbate (AsA) is the most abundant antioxida...,plant,1,1,0.995514,train


In [27]:
combo_pred_json = combo_pred.to_json()
combo_pred_json_file = work_dir / 'corpus_combo_pred.json'
with combo_pred_json_file.open("w+") as f:
  json.dump(combo_pred_json, f)

## ___Mis-predictions___

### Identify mis-predicted entries

In [28]:
val_l = combo_pred['label'].values
val_y = combo_pred['y_pred'].values

In [29]:
incorrect = [val_l[i] != val_y[i] for i in range(len(val_l)) ]

In [30]:
combo_pred['incorrect'] = incorrect
combo_pred.shape

(86646, 11)

In [31]:
combo_pred_wrong = combo_pred[incorrect]
combo_pred_wrong.shape

(2884, 11)

In [32]:
combo_pred_wrong.sample(10)

Unnamed: 0,PMID,Date,Journal,Title,Abstract,QualifiedName,label,y_pred,y_prob,subset,incorrect
5116,173636,1975-01-01,Folia histochemica et cytochemica,Oxidative enzymes in the development of Fascio...,The object of the study was the investigation ...,hepatica,0,1,0.822273,valid,True
14963,605710,1977-01-01,Acta biologica Academiae Scientiarum Hungaricae,Changes in chromosome complement in long-term ...,A prolonged callus culture from pea (Pisum sat...,pea,0,1,0.706658,train,True
881237,24018323,2013-09-11,Advances in protein chemistry and structural b...,Structure-function relationship of the plant p...,"LHCII, the largest plant photosynthetic pigmen...",plant,0,1,0.937996,train,True
1177809,28943973,2017-09-26,Journal of genomics,Permanent Draft Genome sequence for Frankia sp...,Frankia sp. strain CcI49 was isolated from Cas...,plants,0,1,0.842421,test,True
63001,2154301,1990-01-01,Cell differentiation and development : the off...,Is ubiquitin involved in the dedifferentiation...,Transformation of a mesophyll cell into a viab...,plant,0,1,0.979742,train,True
1132663,28286876,2017-03-14,Current opinion in toxicology,Diversity as Opportunity: Insights from 600 Mi...,The aryl hydrocarbon receptor (AHR) was for ma...,anemone,0,1,0.705611,valid,True
717363,20948627,2009-01-01,F1000 biology reports,The dark side of clock-controlled flowering.,Perception of seasonal changes in day length a...,plants,0,1,0.604388,train,True
1086049,27398266,2014-12-01,Journal of geophysical research. Biogeosciences,Models of fluorescence and photosynthesis for ...,We have extended a conventional photosynthesis...,plants,0,1,0.98449,train,True
1042548,26632529,2015-12-04,Phytochemistry,"Differentiation between two ""fang ji"" herbal m...","Stephania tetrandra (""hang fang ji"") and Arist...",plant,1,0,0.191211,valid,True
919443,24504833,1969-03-01,Planta,Second positive phototropic response patterns ...,"1.During second positive irradiation, bending ...",oat,1,0,0.458158,valid,True
