# About this notebook:

- We are creating our baseline Model using OneVsRestClassifier(BernoulliNB()).
- The aim of our model is to given an article abstract - give it NLM Primary Disease terms labels.
- The targeted metric is to have sample avg precision and f1-score > 0.70.

[Part 1: Preparing our X and y](#ID_1)<br>
[Part 2: Baseline Model](#ID_2)<br>
[Part 3: Model Evaluation](#ID_3)<br>

# Library Imports & Functions Creation <a class="anchor" id="ID_1"></a>

In [1]:
import numpy as np 
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


#Visualisation:
import seaborn               as sns
import matplotlib.pyplot     as plt
sns.set_theme(style="whitegrid")

from tqdm import tqdm
tqdm.pandas()

#Showing missing, duplicates, shape, dtypes
def df_summary(df):
    print(f"Shape(col,rows): {df.shape}")
    print(f"Number of duplicates: {df.duplicated().sum()}")
    print('---'*20)
    print(f'Number of each unqiue datatypes:\n{df.dtypes.value_counts()}')
    print('---'*20)
    print("Columns with missing values:")
    isnull_df = pd.DataFrame(df.isnull().sum()).reset_index()
    isnull_df.columns = ['col','num_nulls']
    isnull_df['perc_null'] = ((isnull_df['num_nulls'])/(len(df))).round(2)
    print(isnull_df[isnull_df['num_nulls']>0])

In [2]:
#Preprocessing:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from skmultilearn.model_selection import IterativeStratification


#Modelling
from sklearn.naive_bayes import BernoulliNB
from sklearn.multiclass import OneVsRestClassifier


#Metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

from sklearn.metrics import make_scorer
from sklearn.metrics import recall_score #sensitivity
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report #precision+recall+f1-score

from sklearn.model_selection import GridSearchCV 
from sklearn.pipeline import Pipeline

In [3]:
df = pd.read_csv("Modelling_df.csv")

In [4]:
df.columns

Index(['ID', 'title', 'Pub_Date', 'abstract', 'Article_Given_MeSH',
       'Pri_diseases_name'],
      dtype='object')

In [5]:
df = df.loc[:,['ID','title','abstract','Pri_diseases_name']]

In [6]:
df_summary(df)

Shape(col,rows): (240836, 4)
Number of duplicates: 0
------------------------------------------------------------
Number of each unqiue datatypes:
object    3
int64     1
dtype: int64
------------------------------------------------------------
Columns with missing values:
Empty DataFrame
Columns: [col, num_nulls, perc_null]
Index: []


In [7]:
df.head()

Unnamed: 0,ID,title,abstract,Pri_diseases_name
0,21210353,human leukocyte antigen-g (hla-g) as a marker ...,human leukocyt antigeng hlag nonclass hlacla...,{'neoplasms'}
1,21265258,head and neck follicular dendritic cell sarcom...,current less 50 case head neck follicular den...,"{'neoplasms', 'pathological conditions, signs ..."
2,21245633,effectiveness of repeated intragastric balloon...,19yearold japanes male bmi 554 kgm 2 also li...,"{'nutritional and metabolic diseases', 'pathol..."
3,21194024,golden retriever muscular dystrophy (grmd): de...,studi canin model duchenn muscular dystrophi ...,"{'animal diseases', 'pathological conditions, ..."
4,21220749,dichotomous regulation of gvhd through bidirec...,b lymphocyt attenu btla coinhibitori recepto...,"{'neoplasms', 'pathological conditions, signs ..."


In [8]:
df.dtypes

ID                    int64
title                object
abstract             object
Pri_diseases_name    object
dtype: object

# Part 1: Preparing our X and y <a class="anchor" id="ID_1"></a>

## X: Abstract

In [9]:
X = df['abstract']

## y: Labels

In [10]:
df['Pri_diseases_name'] = df['Pri_diseases_name'].str.strip('{')
df['Pri_diseases_name'] = df['Pri_diseases_name'].str.strip('}')

str_1 = "congenital, hereditary, and neonatal diseases and abnormalities"
sub_1 = str_1.replace(", ","_")
str_2 = "pathological conditions, signs and symptoms"
sub_2 = str_2.replace(", ","_")

df['Pri_diseases_name'] = df['Pri_diseases_name'].str.replace(str_1,sub_1)
df['Pri_diseases_name'] = df['Pri_diseases_name'].str.replace(str_2,sub_2)
df['Pri_diseases_name'] = df['Pri_diseases_name'].str.replace("'","")
df['Pri_diseases_name']  = df['Pri_diseases_name'].str.split(", ")
df

Unnamed: 0,ID,title,abstract,Pri_diseases_name
0,21210353,human leukocyte antigen-g (hla-g) as a marker ...,human leukocyt antigeng hlag nonclass hlacla...,[neoplasms]
1,21265258,head and neck follicular dendritic cell sarcom...,current less 50 case head neck follicular den...,"[neoplasms, pathological conditions_signs and ..."
2,21245633,effectiveness of repeated intragastric balloon...,19yearold japanes male bmi 554 kgm 2 also li...,"[nutritional and metabolic diseases, pathologi..."
3,21194024,golden retriever muscular dystrophy (grmd): de...,studi canin model duchenn muscular dystrophi ...,"[animal diseases, pathological conditions_sign..."
4,21220749,dichotomous regulation of gvhd through bidirec...,b lymphocyt attenu btla coinhibitori recepto...,"[neoplasms, pathological conditions_signs and ..."
...,...,...,...,...
240831,26709456,reactive oxygen species production by human de...,tuberculosi remain singl largest infecti disea...,"[respiratory tract diseases, infections]"
240832,26675461,evaluating the use of commercial west nile vir...,evalu util 2 type commerci avail antigen posit...,"[infections, pathological conditions_signs and..."
240833,26709605,efficacy of protease inhibitor monotherapy vs....,aim analysi review evid updat metaanalysi eval...,"[immune system diseases, infections]"
240834,26662151,the occurrence of chronic lymphocytic leukemia...,occurr chronic myeloid leukemia cml chronic ...,"[neoplasms, cardiovascular diseases]"


In [None]:
y = df['Pri_diseases_name']

# Part 2: Baseline Model <a class="anchor" id="ID_2"></a>

#### Binarize y

In [12]:
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(y)
labels = list(mlb.classes_)

#### Train-test-split

In [14]:
%%time
size=0.75
stratifier = IterativeStratification(n_splits=2, order=2, sample_distribution_per_fold=[size, 1.0-size])

for train, test in stratifier.split(X, y):
    X_train, y_train = X[train], y[train]
    X_test, y_test = X[test], y[test]

CPU times: total: 8min 58s
Wall time: 9min 9s


#### Vectorise X

In [15]:
tvec = TfidfVectorizer()

X_train = tvec.fit_transform(X_train)
X_test = tvec.transform(X_test)

#### Model fitting

In [16]:
# Train the Naive Bayes classifier using the Binary Relevance method
clf = OneVsRestClassifier(BernoulliNB())

In [17]:
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

In [18]:
print(classification_report(y_train, y_train_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.49      0.69      0.57     19576
                                      cardiovascular diseases       0.65      0.57      0.61     27720
                                 chemically-induced disorders       0.29      0.05      0.09      6780
congenital_hereditary_and neonatal diseases and abnormalities       0.21      0.03      0.05      5702
                                    digestive system diseases       0.72      0.23      0.35     12407
                            disorders of environmental origin       1.00      0.00      0.00         2
                                    endocrine system diseases       0.69      0.30      0.42     12674
                                                 eye diseases       0.08      0.01      0.02      3064
                                 hemic and lymphatic diseases       0.14

In [19]:
print(classification_report(y_test, y_test_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.47      0.63      0.54      6513
                                      cardiovascular diseases       0.62      0.44      0.52      9240
                                 chemically-induced disorders       0.25      0.01      0.02      2260
congenital_hereditary_and neonatal diseases and abnormalities       0.17      0.01      0.01      1900
                                    digestive system diseases       0.72      0.08      0.15      4136
                            disorders of environmental origin       1.00      0.00      0.00         1
                                    endocrine system diseases       0.69      0.16      0.26      4225
                                                 eye diseases       0.00      0.00      0.00      1021
                                 hemic and lymphatic diseases       0.09

# Part 3: Model Evaluation <a class="anchor" id="ID_3"></a>

The aim of our model is to given an article abstract - give it NLM Primary Disease terms labels.
The targeted metric is to have samples avg precision and f1-score > 0.70.

**Model performance**:<br>
- Some labels with lesser articles performed dractically poorer. Mostly these labels have distinctly lesser articles labelled with it (e.g. *disorders of environmental origin*, *occupational diseases*).
- Importantly, the baseline mode did not meet our target model metric performance (samples avg precision and f1-score > 0.70)

**Plans for model improvement**:
1. Drop some labels and Merge some similar labels
1. Explore including article title into the X features
1. Explore different classifiers (MultinomialNB, Logistic Regression)
1. Hyperparameter tuning through GridSearchCV