# About this notebook:

- We are going to improve performance from the baseline Model
    - We will explore dropping some labels and merging some similar labels
    - We will explore including article title into the X features
    - We will explore different classifiers (MultinomialNB, Logistic Regression)
- The aim of our model is to given an article abstract - give it NLM Primary Disease terms labels.
- The targeted metric is to have samples avg precision and f1-score > 0.70.

[Part 1: Merge & drop labels, Including Title and Abstract as X ](#ID_1)<br>
[Part 2: Models and Hyperparametertuning](#ID_2)<br>
[Part 3: Best Model Evaluation](#ID_3)<br>
[Part 4: Discussion & Conclusion](#ID_4)<br>

# Downloads & Functions

In [1]:
import numpy as np 
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


#Visualisation:
import seaborn               as sns
import matplotlib.pyplot     as plt
sns.set_theme(style="whitegrid")

from tqdm import tqdm
tqdm.pandas()

#Showing missing, duplicates, shape, dtypes
def df_summary(df):
    print(f"Shape(col,rows): {df.shape}")
    print(f"Number of duplicates: {df.duplicated().sum()}")
    print('---'*20)
    print(f'Number of each unqiue datatypes:\n{df.dtypes.value_counts()}')
    print('---'*20)
    print("Columns with missing values:")
    isnull_df = pd.DataFrame(df.isnull().sum()).reset_index()
    isnull_df.columns = ['col','num_nulls']
    isnull_df['perc_null'] = ((isnull_df['num_nulls'])/(len(df))).round(2)
    print(isnull_df[isnull_df['num_nulls']>0])

In [2]:
#Preprocessing:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from skmultilearn.model_selection import IterativeStratification


#Modelling
from sklearn.naive_bayes import BernoulliNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

#Metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

from sklearn.metrics import make_scorer
from sklearn.metrics import recall_score #sensitivity
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report #precision+recall+f1-score

from sklearn.model_selection import GridSearchCV 
from sklearn.pipeline import Pipeline

In [3]:
#Preprocessing
import re
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

from nltk.tokenize import word_tokenize

#remove stopwords
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [4]:
df = pd.read_csv("Modelling_df.csv")

In [5]:
df = df.loc[:,['ID','title','abstract','Pri_diseases_name']]

In [6]:
df_summary(df)

Shape(col,rows): (240836, 4)
Number of duplicates: 0
------------------------------------------------------------
Number of each unqiue datatypes:
object    3
int64     1
dtype: int64
------------------------------------------------------------
Columns with missing values:
Empty DataFrame
Columns: [col, num_nulls, perc_null]
Index: []


In [7]:
# Necessary processes to convert Pri_diseases_name back to a list of NLM Pri Disease terms
df['Pri_diseases_name'] = df['Pri_diseases_name'].str.strip('{')
df['Pri_diseases_name'] = df['Pri_diseases_name'].str.strip('}')

str_1 = "congenital, hereditary, and neonatal diseases and abnormalities"
sub_1 = str_1.replace(", ","_")
str_2 = "pathological conditions, signs and symptoms"
sub_2 = str_2.replace(", ","_")

df['Pri_diseases_name'] = df['Pri_diseases_name'].str.replace(str_1,sub_1)
df['Pri_diseases_name'] = df['Pri_diseases_name'].str.replace(str_2,sub_2)
df['Pri_diseases_name'] = df['Pri_diseases_name'].str.replace("'","")
df['Pri_diseases_name']  = df['Pri_diseases_name'].str.split(", ")
df

Unnamed: 0,ID,title,abstract,Pri_diseases_name
0,21210353,human leukocyte antigen-g (hla-g) as a marker ...,human leukocyt antigeng hlag nonclass hlacla...,[neoplasms]
1,21265258,head and neck follicular dendritic cell sarcom...,current less 50 case head neck follicular den...,"[neoplasms, pathological conditions_signs and ..."
2,21245633,effectiveness of repeated intragastric balloon...,19yearold japanes male bmi 554 kgm 2 also li...,"[nutritional and metabolic diseases, pathologi..."
3,21194024,golden retriever muscular dystrophy (grmd): de...,studi canin model duchenn muscular dystrophi ...,"[animal diseases, pathological conditions_sign..."
4,21220749,dichotomous regulation of gvhd through bidirec...,b lymphocyt attenu btla coinhibitori recepto...,"[neoplasms, pathological conditions_signs and ..."
...,...,...,...,...
240831,26709456,reactive oxygen species production by human de...,tuberculosi remain singl largest infecti disea...,"[respiratory tract diseases, infections]"
240832,26675461,evaluating the use of commercial west nile vir...,evalu util 2 type commerci avail antigen posit...,"[infections, pathological conditions_signs and..."
240833,26709605,efficacy of protease inhibitor monotherapy vs....,aim analysi review evid updat metaanalysi eval...,"[immune system diseases, infections]"
240834,26662151,the occurrence of chronic lymphocytic leukemia...,occurr chronic myeloid leukemia cml chronic ...,"[neoplasms, cardiovascular diseases]"


# Part 1: Merging some labels, and dropping some labels, and using Title as one of the X features <a class="anchor" id="ID_1"></a>

#### Merge and drop labels

In [8]:
%%time
# 1. remove disorders of environmental origin, and occupational disease
# 2. combine cardiovascular diseases & hemic and lymphatic diseases
# 3. combine 'nutritional and metabolic diseases' & 'endocrine system diseases'
# 4. combine 'wounds and injuries' & 'chemically-induced disorders' 
# 5. combine otorhinolaryngologic diseases & stomatognathic diseases

df_2 = df.copy()

remove_func = lambda lst: [name for name in lst if name != 'disorders of environmental origin']
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(remove_func)

remove_func = lambda lst: [name for name in lst if name != 'occupational diseases']
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(remove_func)

replace_func = lambda lst: [name.replace('cardiovascular diseases', 'cardiovascular, haem and lymphatic diseases') for name in lst]
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(replace_func)

replace_func = lambda lst: [name.replace('hemic and lymphatic diseases', 'cardiovascular, haem and lymphatic diseases') for name in lst]
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(replace_func)

replace_func = lambda lst: [name.replace('nutritional and metabolic diseases', 'nutritional, metabolic, endocrine diseases') for name in lst]
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(replace_func)

replace_func = lambda lst: [name.replace('endocrine system diseases', 'nutritional, metabolic, endocrine diseases') for name in lst]
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(replace_func)

replace_func = lambda lst: [name.replace('chemically-induced disorders', 'wounds,injuries, or chemically-induced disorders') for name in lst]
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(replace_func)

replace_func = lambda lst: [name.replace('wounds and injuries', 'wounds,injuries, or chemically-induced disorders') for name in lst]
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(replace_func)

replace_func = lambda lst: [name.replace('stomatognathic diseases', 'otorhinolaryngologic & stomatognathic diseases') for name in lst]
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(replace_func)

replace_func = lambda lst: [name.replace('otorhinolaryngologic diseases', 'otorhinolaryngologic & stomatognathic diseases') for name in lst]
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(replace_func)

remove_dup_func = lambda lst: list(set(lst))
df_2['Pri_diseases_name'] = df_2['Pri_diseases_name'].apply(remove_dup_func)
df_2

CPU times: total: 3.72 s
Wall time: 3.72 s


Unnamed: 0,ID,title,abstract,Pri_diseases_name
0,21210353,human leukocyte antigen-g (hla-g) as a marker ...,human leukocyt antigeng hlag nonclass hlacla...,[neoplasms]
1,21265258,head and neck follicular dendritic cell sarcom...,current less 50 case head neck follicular den...,"[pathological conditions_signs and symptoms, n..."
2,21245633,effectiveness of repeated intragastric balloon...,19yearold japanes male bmi 554 kgm 2 also li...,"[nutritional, metabolic, endocrine diseases, p..."
3,21194024,golden retriever muscular dystrophy (grmd): de...,studi canin model duchenn muscular dystrophi ...,"[animal diseases, pathological conditions_sign..."
4,21220749,dichotomous regulation of gvhd through bidirec...,b lymphocyt attenu btla coinhibitori recepto...,"[pathological conditions_signs and symptoms, n..."
...,...,...,...,...
240831,26709456,reactive oxygen species production by human de...,tuberculosi remain singl largest infecti disea...,"[respiratory tract diseases, infections]"
240832,26675461,evaluating the use of commercial west nile vir...,evalu util 2 type commerci avail antigen posit...,"[pathological conditions_signs and symptoms, i..."
240833,26709605,efficacy of protease inhibitor monotherapy vs....,aim analysi review evid updat metaanalysi eval...,"[immune system diseases, infections]"
240834,26662151,the occurrence of chronic lymphocytic leukemia...,occurr chronic myeloid leukemia cml chronic ...,"[cardiovascular, haem and lymphatic diseases, ..."


#### Investigating articles that now do not have Pri Diseases labesl

In [9]:
df_2["pri_labels_count"] = df_2["Pri_diseases_name"].apply(lambda x: len(x))
df_2

Unnamed: 0,ID,title,abstract,Pri_diseases_name,pri_labels_count
0,21210353,human leukocyte antigen-g (hla-g) as a marker ...,human leukocyt antigeng hlag nonclass hlacla...,[neoplasms],1
1,21265258,head and neck follicular dendritic cell sarcom...,current less 50 case head neck follicular den...,"[pathological conditions_signs and symptoms, n...",2
2,21245633,effectiveness of repeated intragastric balloon...,19yearold japanes male bmi 554 kgm 2 also li...,"[nutritional, metabolic, endocrine diseases, p...",2
3,21194024,golden retriever muscular dystrophy (grmd): de...,studi canin model duchenn muscular dystrophi ...,"[animal diseases, pathological conditions_sign...",2
4,21220749,dichotomous regulation of gvhd through bidirec...,b lymphocyt attenu btla coinhibitori recepto...,"[pathological conditions_signs and symptoms, n...",2
...,...,...,...,...,...
240831,26709456,reactive oxygen species production by human de...,tuberculosi remain singl largest infecti disea...,"[respiratory tract diseases, infections]",2
240832,26675461,evaluating the use of commercial west nile vir...,evalu util 2 type commerci avail antigen posit...,"[pathological conditions_signs and symptoms, i...",2
240833,26709605,efficacy of protease inhibitor monotherapy vs....,aim analysi review evid updat metaanalysi eval...,"[immune system diseases, infections]",2
240834,26662151,the occurrence of chronic lymphocytic leukemia...,occurr chronic myeloid leukemia cml chronic ...,"[cardiovascular, haem and lymphatic diseases, ...",2


In [10]:
_ = df_2.loc[df_2['pri_labels_count']==0,:]
print(f"{len(_)} articles now do not have NLM Pri Disease terms labels")

12 articles now do not have NLM Pri Disease terms labels


In [11]:
#remove datarows articles that now do not have NLM Pri Disease terms labels
df_2 = df_2.loc[df_2['pri_labels_count']!=0,:]

In [12]:
df_2.reset_index(drop=True,inplace=True)

In [13]:
df_2.explode('Pri_diseases_name')['Pri_diseases_name'].unique()

array(['neoplasms', 'pathological conditions_signs and symptoms',
       'nutritional, metabolic, endocrine diseases', 'animal diseases',
       'musculoskeletal diseases',
       'congenital_hereditary_and neonatal diseases and abnormalities',
       'nervous system diseases',
       'cardiovascular, haem and lymphatic diseases',
       'skin and connective tissue diseases',
       'wounds,injuries, or chemically-induced disorders', 'infections',
       'digestive system diseases', 'immune system diseases',
       'respiratory tract diseases',
       'otorhinolaryngologic & stomatognathic diseases',
       'urogenital diseases', 'eye diseases'], dtype=object)

In [14]:
_ = df_2.explode('Pri_diseases_name')['Pri_diseases_name'].nunique()
print(f"Number of NLM Pri Diseases terms as labels: {_}")

Number of NLM Pri Diseases terms as labels: 17


In [15]:
df_2.shape

(240824, 5)

#### Trainin Test Split<br>
We use 'ID' as X first to split data, as IterativeStratification can only use one column for X

In [16]:
X = df_2['ID']
y = df_2['Pri_diseases_name']

In [17]:
#binarize y
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(y)
labels = list(mlb.classes_)

In [18]:
%%time
size=0.75
stratifier = IterativeStratification(n_splits=2, order=2, samples_distribution_per_fold=[size, 1.0-size])

for train, test in stratifier.split(X, y):
    X_train, y_train = X[train], y[train]
    X_test, y_test = X[test], y[test]

CPU times: total: 9min 1s
Wall time: 9min 3s


In [19]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

#### Set X features as Title and Abstract

In [20]:
_d = df_2.loc[:,['ID', 'title', 'abstract']]
_d

Unnamed: 0,ID,title,abstract
0,21210353,human leukocyte antigen-g (hla-g) as a marker ...,human leukocyt antigeng hlag nonclass hlacla...
1,21265258,head and neck follicular dendritic cell sarcom...,current less 50 case head neck follicular den...
2,21245633,effectiveness of repeated intragastric balloon...,19yearold japanes male bmi 554 kgm 2 also li...
3,21194024,golden retriever muscular dystrophy (grmd): de...,studi canin model duchenn muscular dystrophi ...
4,21220749,dichotomous regulation of gvhd through bidirec...,b lymphocyt attenu btla coinhibitori recepto...
...,...,...,...
240819,26709456,reactive oxygen species production by human de...,tuberculosi remain singl largest infecti disea...
240820,26675461,evaluating the use of commercial west nile vir...,evalu util 2 type commerci avail antigen posit...
240821,26709605,efficacy of protease inhibitor monotherapy vs....,aim analysi review evid updat metaanalysi eval...
240822,26662151,the occurrence of chronic lymphocytic leukemia...,occurr chronic myeloid leukemia cml chronic ...


In [21]:
def stem_sentences(sentence):
    tokens = word_tokenize(sentence)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)
def remove_spec_char(text):
    return re.sub('[^A-Za-z0-9 ]+', '',text)   

In [22]:
X_train = X_train.join(_d.set_index('ID'), on='ID')
X_train.drop(columns='ID',inplace=True)

X_test = X_test.join(_d.set_index('ID'), on='ID')
X_test.drop(columns='ID',inplace=True)

In [23]:
X_train.reset_index(drop=True,inplace=True)
X_train['title'] = X_train['title'].apply(stem_sentences)
X_train['title'] = X_train['title'].apply(remove_spec_char)

X_test.reset_index(drop=True,inplace=True)
X_test['title'] = X_test['title'].apply(stem_sentences)
X_test['title'] = X_test['title'].apply(remove_spec_char)

In [24]:
X_train['combine'] = X_train['title']+X_train['abstract']
X_test['combine'] = X_test['title']+X_test['abstract']

In [25]:
X_train.head()
X_test.head()

Unnamed: 0,title,abstract,combine
0,head and neck follicular dendrit cell sarcoma ...,current less 50 case head neck follicular den...,head and neck follicular dendrit cell sarcoma ...
1,effect of repeat intragastr balloon therapi in...,19yearold japanes male bmi 554 kgm 2 also li...,effect of repeat intragastr balloon therapi in...
2,golden retriev muscular dystrophi grmd deve...,studi canin model duchenn muscular dystrophi ...,golden retriev muscular dystrophi grmd deve...
3,polymorph in the pituitari growth hormon gene ...,investig promot polymorph pituitari growth hor...,polymorph in the pituitari growth hormon gene ...
4,qualit studi of highcost patient in an urban p...,examin patient account ill care among primari ...,qualit studi of highcost patient in an urban p...


Unnamed: 0,title,abstract,combine
0,human leukocyt antigeng hlag as a marker for...,human leukocyt antigeng hlag nonclass hlacla...,human leukocyt antigeng hlag as a marker for...
1,dichotom regul of gvhd through bidirect functi...,b lymphocyt attenu btla coinhibitori recepto...,dichotom regul of gvhd through bidirect functi...
2,multimod imag of chronic tophac gout,diagnosi gout usual base clinic present labora...,multimod imag of chronic tophac gout diagnos...
3,intraop choroid detach dure 23gaug vitrectomi,review intraop choroid detach 23gaug vitrectom...,intraop choroid detach dure 23gaug vitrectomi ...
4,ciita is not associ with risk of develop rheum...,major histocompat complex mhc class ii trans...,ciita is not associ with risk of develop rheum...


#### Vectorize X_train X_test

In [26]:
#vectorise X
tvec = TfidfVectorizer()
X_train_tvec = tvec.fit_transform(X_train['combine'])
X_test_tvec = tvec.transform(X_test['combine'])

In [27]:
X_train_tvec.shape
X_test_tvec.shape

(180364, 581123)

(60460, 581123)

In [28]:
X_train.shape
X_test.shape

(180364, 3)

(60460, 3)

#### Model fitting

In [29]:
# Train the Naive Bayes classifier using the Binary Relevance method
clf = OneVsRestClassifier(BernoulliNB())

In [30]:
clf.fit(X_train_tvec, y_train)

y_train_pred = clf.predict(X_train_tvec)
y_test_pred = clf.predict(X_test_tvec)

#### Train Data performance

In [31]:
print(classification_report(y_train, y_train_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.49      0.70      0.57     19557
                  cardiovascular, haem and lymphatic diseases       0.68      0.59      0.63     30781
congenital_hereditary_and neonatal diseases and abnormalities       0.22      0.03      0.05      5702
                                    digestive system diseases       0.72      0.23      0.35     12407
                                                 eye diseases       0.08      0.01      0.02      3064
                                       immune system diseases       0.75      0.42      0.54     16609
                                                   infections       0.84      0.79      0.82     37765
                                     musculoskeletal diseases       0.76      0.29      0.42     13010
                                                    neoplasms       0.90

#### Test Data performance

In [32]:
print(classification_report(y_test, y_test_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.47      0.63      0.54      6532
                  cardiovascular, haem and lymphatic diseases       0.63      0.46      0.53     10260
congenital_hereditary_and neonatal diseases and abnormalities       0.17      0.01      0.01      1900
                                    digestive system diseases       0.73      0.09      0.15      4136
                                                 eye diseases       0.00      0.00      0.00      1021
                                       immune system diseases       0.76      0.26      0.39      5536
                                                   infections       0.83      0.74      0.78     12588
                                     musculoskeletal diseases       0.78      0.16      0.26      4336
                                                    neoplasms       0.89

#### **Comments**:<br>

The goal for our model is to achieve samples avg precision and f1-score > 0.70.<br>


|Model|Description|Train samples avg precision|Train samples avg f1-score|Test samples avg precision|Test samples avg f1-score|Remarks|
|----|----|----|----|----|----|----|
|Baseline Model|<li>Labels=23</li><li>Features: Abstract</li><li>BernoulliNB</li><li>TFIDF</li>|0.81|0.63|0.77|0.56|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li>|
|Model 1|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>BernoulliNB</li><li>TFIDF</li>|0.81|0.64|0.78|0.57|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li><li>There is some improvement from baseline model</li>|

# Part 2: Models and Hyperparametertuning (GridSearch) <a class="anchor" id="ID_2"></a>

In [33]:
X_train = X_train.loc[:,'combine']
X_test = X_test.loc[:,'combine']

### BernoulliNB

In [36]:
%%time
# Define the pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', OneVsRestClassifier(BernoulliNB()))
])

# Define the hyperparameters to search over
parameters = {
    'tfidf__min_df': [1,5,10],
    'tfidf__max_df': [0.5,1],
    'tfidf__max_features': [100000, 581245],
    'clf__estimator__alpha': [0.1,1.0]
}
# Define scoring metric
scoring = make_scorer(f1_score, average='samples')

# Perform grid search cross-validation
grid_search_1 = GridSearchCV(pipeline, parameters, cv=3, verbose=5, scoring=scoring)
grid_search_1.fit(X_train, y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV 1/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=1;, score=0.608 total time=  24.2s
[CV 2/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=1;, score=0.609 total time=  23.5s
[CV 3/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=1;, score=0.607 total time=  23.2s
[CV 1/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=5;, score=0.601 total time=  23.2s
[CV 2/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=5;, score=0.602 total time=  22.8s
[CV 3/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=5;, score=0.600 total time=  22.7s
[CV 1/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=10;, score=0.594 total

24 fits failed out of a total of 72.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
24 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\geok1\.conda\envs\dsi-sg\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\geok1\.conda\envs\dsi-sg\lib\site-packages\sklearn\pipeline.py", line 378, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\geok1\.conda\envs\dsi-sg\lib\site-packages\sklearn\pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\geok1\.conda\envs\dsi-sg\lib\site-packages\joblib\memory.py", line 349, in __call__
    retu

CPU times: total: 22min 41s
Wall time: 22min 55s


In [37]:
print("Best parameters: ", grid_search_1.best_params_)
print("Best score: ", grid_search_1.best_score_)

Best parameters:  {'clf__estimator__alpha': 0.1, 'tfidf__max_df': 0.5, 'tfidf__max_features': 581245, 'tfidf__min_df': 1}
Best score:  0.6262559915123315


In [38]:
y_train_pred = grid_search_1.predict(X_train)
y_test_pred = grid_search_1.predict(X_test)

#### Train Data performance

In [39]:
print(classification_report(y_train, y_train_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.41      0.92      0.57     19557
                  cardiovascular, haem and lymphatic diseases       0.66      0.89      0.76     30781
congenital_hereditary_and neonatal diseases and abnormalities       0.39      0.87      0.53      5702
                                    digestive system diseases       0.65      0.92      0.76     12407
                                                 eye diseases       0.57      0.87      0.69      3064
                                       immune system diseases       0.54      0.91      0.68     16609
                                                   infections       0.83      0.91      0.87     37765
                                     musculoskeletal diseases       0.72      0.91      0.80     13010
                                                    neoplasms       0.87

#### Test Data performance

In [40]:
print(classification_report(y_test, y_test_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.38      0.86      0.52      6532
                  cardiovascular, haem and lymphatic diseases       0.54      0.72      0.61     10260
congenital_hereditary_and neonatal diseases and abnormalities       0.30      0.55      0.39      1900
                                    digestive system diseases       0.57      0.72      0.64      4136
                                                 eye diseases       0.57      0.61      0.59      1021
                                       immune system diseases       0.45      0.76      0.57      5536
                                                   infections       0.77      0.83      0.80     12588
                                     musculoskeletal diseases       0.61      0.68      0.64      4336
                                                    neoplasms       0.79

#### **Comments**:<br>

The goal for our model is to achieve samples avg precision and f1-score > 0.70.<br>


|Model|Description|Train samples avg precision|Train samples avg f1-score|Test samples avg precision|Test samples avg f1-score|Remarks|
|----|----|----|----|----|----|----|
|Baseline Model|<li>Labels=23</li><li>Features: Abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li>|0.81|0.63|0.77|0.56|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li>|
|Model 1|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li>|0.81|0.64|0.78|0.57|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li><li>There is some improvement from baseline model</li>|
|Model 2|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li><li>GridSearchCV used</li>|0.71|0.75|0.62|0.62|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li><li>There is some improvement from baseline model</li>|

### MultinomialNB

In [41]:
%%time
# Define the pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', OneVsRestClassifier(MultinomialNB()))
])

# Define the hyperparameters to search over
parameters = {
    'tfidf__min_df': [1,5],
    'tfidf__max_df': [0.5,1],
    'tfidf__max_features': [100000,581245],
    'clf__estimator__alpha': [0.1,1.0]
}
# Define scoring metric
scoring = make_scorer(f1_score, average='samples')

# Perform grid search cross-validation
grid_search_2 = GridSearchCV(pipeline, parameters, cv=3, verbose=5, scoring=scoring)
grid_search_2.fit(X_train, y_train)

print("Best parameters: ", grid_search_2.best_params_)
print("Best score: ", grid_search_2.best_score_)

y_train_pred = grid_search_2.predict(X_train)
y_test_pred = grid_search_2.predict(X_test)

Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV 1/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=1;, score=0.663 total time=  21.4s
[CV 2/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=1;, score=0.669 total time=  21.3s
[CV 3/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=1;, score=0.666 total time=  21.2s
[CV 1/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=5;, score=0.671 total time=  21.0s
[CV 2/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=5;, score=0.677 total time=  21.2s
[CV 3/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=5;, score=0.673 total time=  21.0s
[CV 1/3] END clf__estimator__alpha=0.1, tfidf__max_df=0.5, tfidf__max_features=581245, tfidf__min_df=1;, score=0.599 total 

12 fits failed out of a total of 48.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\geok1\.conda\envs\dsi-sg\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\geok1\.conda\envs\dsi-sg\lib\site-packages\sklearn\pipeline.py", line 378, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\geok1\.conda\envs\dsi-sg\lib\site-packages\sklearn\pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\geok1\.conda\envs\dsi-sg\lib\site-packages\joblib\memory.py", line 349, in __call__
    retu

Best parameters:  {'clf__estimator__alpha': 0.1, 'tfidf__max_df': 0.5, 'tfidf__max_features': 100000, 'tfidf__min_df': 5}
Best score:  0.673479534249489
CPU times: total: 15min 31s
Wall time: 15min 40s


#### Train Data performance

In [42]:
print(classification_report(y_train, y_train_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.55      0.69      0.61     19557
                  cardiovascular, haem and lymphatic diseases       0.75      0.65      0.69     30781
congenital_hereditary_and neonatal diseases and abnormalities       0.74      0.43      0.55      5702
                                    digestive system diseases       0.84      0.60      0.70     12407
                                                 eye diseases       0.69      0.64      0.66      3064
                                       immune system diseases       0.81      0.68      0.74     16609
                                                   infections       0.88      0.81      0.84     37765
                                     musculoskeletal diseases       0.86      0.59      0.70     13010
                                                    neoplasms       0.91

#### Test Data performance

In [43]:
print(classification_report(y_test, y_test_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.51      0.64      0.56      6532
                  cardiovascular, haem and lymphatic diseases       0.71      0.60      0.65     10260
congenital_hereditary_and neonatal diseases and abnormalities       0.65      0.31      0.42      1900
                                    digestive system diseases       0.81      0.51      0.62      4136
                                                 eye diseases       0.67      0.51      0.58      1021
                                       immune system diseases       0.78      0.61      0.68      5536
                                                   infections       0.86      0.79      0.82     12588
                                     musculoskeletal diseases       0.82      0.50      0.62      4336
                                                    neoplasms       0.90

#### **Comments**:<br>

The goal for our model is to achieve samples avg precision and f1-score > 0.70.<br>


|Model|Description|Train samples avg precision|Train samples avg f1-score|Test samples avg precision|Test samples avg f1-score|Remarks|
|----|----|----|----|----|----|----|
|Baseline Model|<li>Labels=23</li><li>Features: Abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li>|0.81|0.63|0.77|0.56|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li>|
|Model 1|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li>|0.81|0.64|0.78|0.57|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li><li>There is some improvement from baseline model</li>|
|Model 2|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li><li>GridSearchCV used</li>|0.71|0.75|0.62|0.62|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li><li>There is some improvement from baseline model</li>|
|Model 3|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(MultinomialNB)</li><li>TFIDF</li><li>GridSearchCV used</li>|0.82|0.72|0.78|0.68|<li>Did not meet the targeted model performance</li><li>There is improvement from previous models</li>|

### Logistic Regression 1

In [44]:
%%time
# Define the pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', OneVsRestClassifier(LogisticRegression(max_iter=1000)))
])

# Define the hyperparameters to search over
parameters = {
    'tfidf__min_df': [5],
    'tfidf__max_df': [0.5],
    'tfidf__max_features': [100000,581245],
    'clf__estimator__C': [0.1, 1.0, 10.0]
}
# Define scoring metric
scoring = make_scorer(f1_score, average='samples')

# Perform grid search cross-validation
grid_search_3 = GridSearchCV(pipeline, parameters, cv=3, verbose=5, scoring=scoring)
grid_search_3.fit(X_train, y_train)

print("Best parameters: ", grid_search_3.best_params_)
print("Best score: ", grid_search_3.best_score_)

y_train_pred = grid_search_3.predict(X_train)
y_test_pred = grid_search_3.predict(X_test)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV 1/3] END clf__estimator__C=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=5;, score=0.619 total time=  43.3s
[CV 2/3] END clf__estimator__C=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=5;, score=0.622 total time=  44.9s
[CV 3/3] END clf__estimator__C=0.1, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=5;, score=0.618 total time=  43.6s
[CV 1/3] END clf__estimator__C=0.1, tfidf__max_df=0.5, tfidf__max_features=581245, tfidf__min_df=5;, score=0.619 total time=  43.6s
[CV 2/3] END clf__estimator__C=0.1, tfidf__max_df=0.5, tfidf__max_features=581245, tfidf__min_df=5;, score=0.622 total time=  44.9s
[CV 3/3] END clf__estimator__C=0.1, tfidf__max_df=0.5, tfidf__max_features=581245, tfidf__min_df=5;, score=0.618 total time=  43.6s
[CV 1/3] END clf__estimator__C=1.0, tfidf__max_df=0.5, tfidf__max_features=100000, tfidf__min_df=5;, score=0.722 total time= 1.3min
[CV 2/3] END clf

#### Train Data performance

In [45]:
print(classification_report(y_train, y_train_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.87      0.76      0.81     19557
                  cardiovascular, haem and lymphatic diseases       0.92      0.78      0.85     30781
congenital_hereditary_and neonatal diseases and abnormalities       0.95      0.66      0.78      5702
                                    digestive system diseases       0.94      0.79      0.86     12407
                                                 eye diseases       0.96      0.82      0.88      3064
                                       immune system diseases       0.96      0.82      0.88     16609
                                                   infections       0.96      0.91      0.93     37765
                                     musculoskeletal diseases       0.95      0.80      0.87     13010
                                                    neoplasms       0.97

#### Test Data performance

In [46]:
print(classification_report(y_test, y_test_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.71      0.59      0.64      6532
                  cardiovascular, haem and lymphatic diseases       0.82      0.67      0.73     10260
congenital_hereditary_and neonatal diseases and abnormalities       0.77      0.43      0.55      1900
                                    digestive system diseases       0.85      0.63      0.73      4136
                                                 eye diseases       0.84      0.58      0.68      1021
                                       immune system diseases       0.89      0.73      0.80      5536
                                                   infections       0.91      0.84      0.87     12588
                                     musculoskeletal diseases       0.86      0.65      0.74      4336
                                                    neoplasms       0.93

#### **Comments**:<br>

The goal for our model is to achieve samples avg precision and f1-score > 0.70.<br>


|Model|Description|Train samples avg precision|Train samples avg f1-score|Test samples avg precision|Test samples avg f1-score|Remarks|
|----|----|----|----|----|----|----|
|Baseline Model|<li>Labels=23</li><li>Features: Abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li>|0.81|0.63|0.77|0.56|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li>|
|Model 1|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li>|0.81|0.64|0.78|0.57|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li><li>There is some improvement from baseline model</li>|
|Model 2|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li><li>GridSearchCV used</li>|0.71|0.75|0.62|0.62|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li><li>There is some improvement from baseline model</li>|
|Model 3|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(MultinomialNB)</li><li>TFIDF</li><li>GridSearchCV used</li>|0.82|0.72|0.78|0.68|<li>Did not meet the targeted model performance</li><li>There is improvement from previous models</li>|
|Model 4|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(LogisticRegression)</li><li>TFIDF</li><li>GridSearchCV used</li>|0.93|0.86|0.86|0.75|<li>There is improvement from previous models</li><li>Met the targeted metric performance BUT</li><li>Sign of overfitting</li>|

### Logistic Regression 2

In [47]:
%%time
#vectorise X
tvec = TfidfVectorizer(max_df=0.5,min_df=5,max_features=10000)
X_train_tvec = tvec.fit_transform(X_train)
X_test_tvec = tvec.transform(X_test)

CPU times: total: 25.8 s
Wall time: 26 s


In [48]:
# Train the Naive Bayes classifier using the Binary Relevance method
clf = OneVsRestClassifier(LogisticRegression(max_iter=1000,C=10))

In [49]:
%%time
clf.fit(X_train_tvec, y_train)

y_train_pred = clf.predict(X_train_tvec)
y_test_pred = clf.predict(X_test_tvec)

CPU times: total: 2min 51s
Wall time: 2min 52s


In [50]:
print(classification_report(y_train, y_train_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.79      0.64      0.71     19557
                  cardiovascular, haem and lymphatic diseases       0.87      0.71      0.78     30781
congenital_hereditary_and neonatal diseases and abnormalities       0.85      0.47      0.61      5702
                                    digestive system diseases       0.90      0.71      0.79     12407
                                                 eye diseases       0.92      0.71      0.80      3064
                                       immune system diseases       0.93      0.76      0.84     16609
                                                   infections       0.94      0.86      0.89     37765
                                     musculoskeletal diseases       0.91      0.71      0.80     13010
                                                    neoplasms       0.95

In [51]:
print(classification_report(y_test, y_test_pred,zero_division=1,target_names=labels))
print("\n")

                                                               precision    recall  f1-score   support

                                              animal diseases       0.71      0.59      0.65      6532
                  cardiovascular, haem and lymphatic diseases       0.82      0.66      0.73     10260
congenital_hereditary_and neonatal diseases and abnormalities       0.67      0.37      0.48      1900
                                    digestive system diseases       0.84      0.62      0.72      4136
                                                 eye diseases       0.81      0.56      0.66      1021
                                       immune system diseases       0.88      0.73      0.80      5536
                                                   infections       0.90      0.83      0.86     12588
                                     musculoskeletal diseases       0.85      0.64      0.73      4336
                                                    neoplasms       0.93

#### **Comments**:<br>

The goal for our model is to achieve samples avg precision and f1-score > 0.70.<br>


|Model|Description|Train samples avg precision|Train samples avg f1-score|Test samples avg precision|Test samples avg f1-score|Remarks|
|----|----|----|----|----|----|----|
|Baseline Model|<li>Labels=23</li><li>Features: Abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li>|0.81|0.63|0.77|0.56|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li>|
|Model 1|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li>|0.81|0.64|0.78|0.57|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li><li>There is some improvement from baseline model</li>|
|Model 2|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(BernoulliNB)</li><li>TFIDF</li><li>GridSearchCV used</li>|0.71|0.75|0.62|0.62|<li>Did not meet the targeted model performance</li><li>Sign of overfitting</li><li>There is some improvement from baseline model</li>|
|Model 3|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(MultinomialNB)</li><li>TFIDF</li><li>GridSearchCV used</li>|0.82|0.72|0.78|0.68|<li>Did not meet the targeted model performance</li><li>There is improvement from previous models</li>|
|Model 4|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(LogisticRegression)</li><li>TFIDF</li><li>GridSearchCV used</li>|0.93|0.86|0.86|0.75|<li>There is improvement from previous models</li><li>Met the targeted metric performance BUT</li><li>Sign of overfitting</li>|
|Model 5|<li>Labels=17</li><li>Features: Inlcuded both title and abstract</li><li>OneVsRestClassifier(LogisticRegression)</li><li>TFIDF</li>|0.89|0.80|0.85|0.75|<li>There is improvement from previous models</li><li>Met the targeted metric performance</li>|

# Part 3:Model Evaluation <a class="anchor" id="ID_3"></a>
 
The aim of our model is to given an article abstract - give it NLM Primary Disease terms labels.
The targeted metric is to have samples avg precision and f1-score > 0.70.

**Best Model performance**: Model 5<br>
- Model 5 meet our target model metric performance 
    - precision=0.85 and f1-score=0.75
    - Has reduced overfitting (difference between train and test score about 5% or less)

# Part 4: Discussion & Conclusion <a class="anchor" id="ID_4"></a>
## 4.1 Summary:
The image below shows an overview of the whole project:<br>
![img3](img/Project_Introduction_and_Overview_Summary.jpg)
## 4.2 Limitation:
|Area of limitation|Limitation Description|Actions needed|
|---|---|---|
|Matching Articles to NLM Disease groups|Some NLM Disease groups are poorly represented in our dataset (e.g. due to the limitation in our dictionaries that contain disease-associated terms|Our dictionary to be further improved by including more terms associated to these diseases|
|Some labels might be too broad|E.g. Cardiovascular diseases & Hemic and Lymphatic diseases. The associated terms in the dictionaries are too overlapping.|Our dictionary to be further improved by looking deeper for distinct terms associated to the individual disease types|
|Model's application|In relation to the scale and complexity of the existing problem, our model is limited to classifying articles related to *diseases* to NLM Primary Diseases groups.| It will require much more understanding of the complexity and the varying relations of the MeSH terms to create an ultimate machine learning solution that helps label any given article with desired NLM standardised group of MeSH terms.|
## 4.3 Future work:
An ideal to have a automated algorithm, that is able to label any article using NCBI's standardised collection of labels when given any article's Title and Abstract.

To achieve so, we will need:
1. **More articles properlly labelled** with PubMed collection of standardised MeSH for model training
1. **More understanding of the processes behind** that NCBI undergoes in handling the vast variety of articles in the database
1. **More understanding of the complexity of MeSH** in PubMed collection of standardised MeSH, including the updating process of new MeSH as the field of medicine continues to advance and diversify.