<img src="../images/topcover.jpg" width="1000" height="50">

##### In earlier notebooks, it is seen that classification between Stroma and Tumour images has potential to improve diagnosis and shed light on the severity of the disease based on the stroma-rich or stroma-poor groups. After diagnosis, treatments for cancer can be made more personalised if more information is known about the type of mutations driving colorectal cancer in the patient. Therefore, clinical text data on colorectal cancer can help to identify the possible mutation type. If treatments are available for that particular mutation, that treatment can be administered. Otherwise, research can be carried out to find novel treatments for various mutation types related to colorectal cancer. 



In [28]:
# imports relevant modules

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as xgboost
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

# Import CountVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_score

In [180]:
pip install imblearn




In [183]:
pip install delayed




In [2]:
from imblearn.over_sampling import SMOTE

In [3]:
#train_colorectal = pd.read_csv('../data/train_colorectal.csv')
train_nlp = pd.read_csv('../data/traincolorectal.csv')

In [4]:
train_nlp

Unnamed: 0,ID,Text,Gene,Variation,Class
0,28,sequencing studies have identified many recurr...,TERT,C228T,7
1,31,sequencing studies have identified many recurr...,TERT,Promoter Mutations,7
2,33,the current world health organization classifi...,TERT,Amplification,2
3,34,sequencing studies have identified many recurr...,TERT,C250T,7
4,35,abstract dicer plays a critical role in micr...,DICER1,G1809R,4
...,...,...,...,...,...
916,3256,neuroblastoma the most common paediatric solid...,CASP8,Promoter Hypermethylation,4
917,3262,ret is a singlepass transmembrane receptor tyr...,RET,S891A,7
918,3269,oncogenic fusion of the ret rearranged during ...,RET,Fusions,2
919,3278,ret is a singlepass transmembrane receptor tyr...,RET,A883F,7


In [5]:
# set up data for modelling

X = train_nlp['Text']
y = train_nlp['Class']

In [6]:
# Check distribution since this is a classification problem

y.value_counts(normalize = True)

7    0.378936
1    0.187839
4    0.156352
2    0.143322
5    0.049946
6    0.048860
3    0.019544
9    0.008686
8    0.006515
Name: Class, dtype: float64

In [7]:
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [8]:
# instantiate our CountVectorizer with default parameter and exclude stop words

cvec = CountVectorizer(analyzer='word', tokenizer=LemmaTokenizer(), ngram_range=(1, 1))
# max_features=1000    addddd

In [9]:
X = cvec.fit_transform(X)

In [10]:
# split the data into the training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.4,
                                                    stratify=y,
                                                    random_state=42
                                                    )

In [11]:
# observe x shape

X_train.shape

(552, 91254)

In [12]:

cvec.get_feature_names()[10:25]

['+a',
 '+ap',
 '+association',
 '+at',
 '+bach',
 '+bp',
 '+byl',
 '+cetuximab',
 '+chemo',
 '+chx',
 '+cobimetinib',
 '+d',
 '+dd',
 '+delptpqp',
 '+distal']

In [13]:
X_test.shape

(369, 91254)

In [14]:
# to tackle imbalanced classes

X_resample, y_resampled = SMOTE(k_neighbors=2).fit_resample(X_train, y_train)

## Naiive Bayes Model

In [15]:
# choose multinomial naiive bayes

# instantiate our model

nb = MultinomialNB()

In [16]:
# fit our model

model = nb.fit(X_resample, y_resampled)

In [17]:
# generate our predictions

predictions = model.predict(X_test)

In [18]:
# accuracy score of our model on the training set.

model.score(X_resample, y_resampled)

0.8787878787878788

In [19]:
# accuracy score of our model on the testing set.

model.score(X_test, y_test)

0.5934959349593496

In [20]:
predictions 

array([3, 5, 7, 2, 7, 7, 7, 1, 6, 1, 5, 7, 7, 7, 6, 2, 7, 1, 1, 2, 6, 9,
       7, 1, 2, 3, 3, 7, 5, 7, 7, 2, 1, 4, 3, 4, 1, 7, 2, 3, 7, 4, 7, 7,
       2, 7, 1, 1, 7, 7, 7, 4, 7, 6, 5, 7, 7, 7, 7, 6, 7, 2, 2, 5, 5, 7,
       7, 7, 2, 2, 7, 7, 4, 7, 5, 1, 7, 3, 1, 4, 7, 4, 5, 7, 7, 1, 1, 7,
       7, 1, 2, 7, 7, 4, 1, 7, 2, 6, 3, 2, 7, 2, 7, 1, 7, 2, 2, 2, 7, 7,
       5, 3, 1, 7, 4, 3, 1, 7, 7, 7, 2, 7, 1, 6, 4, 5, 1, 1, 7, 5, 7, 4,
       7, 1, 6, 1, 7, 1, 4, 7, 2, 4, 4, 4, 1, 2, 7, 1, 2, 3, 2, 1, 7, 7,
       1, 1, 7, 2, 1, 1, 5, 1, 1, 1, 2, 1, 7, 1, 7, 2, 7, 7, 1, 2, 2, 1,
       2, 1, 4, 7, 4, 2, 1, 7, 5, 7, 7, 7, 5, 2, 7, 6, 7, 9, 7, 1, 1, 7,
       2, 1, 1, 7, 7, 7, 7, 2, 1, 2, 7, 5, 7, 7, 7, 1, 2, 2, 1, 7, 5, 7,
       1, 6, 2, 5, 1, 7, 7, 7, 2, 5, 7, 2, 1, 7, 7, 1, 1, 6, 7, 7, 1, 1,
       7, 4, 1, 5, 7, 2, 4, 7, 1, 3, 1, 5, 1, 7, 7, 7, 1, 1, 3, 7, 3, 2,
       1, 7, 2, 7, 7, 2, 7, 3, 1, 2, 4, 7, 1, 3, 3, 1, 4, 6, 7, 1, 5, 5,
       4, 5, 3, 2, 7, 2, 1, 2, 7, 7, 9, 1, 7, 7, 4,

In [21]:
f1_score1 = f1_score(y_test, predictions, average='weighted')
f1_score1

0.5892954573084404

In [30]:
precision1 = precision_score(y_test, predictions, average='weighted')
precision1 

  _warn_prf(average, modifier, msg_start, len(result))


0.6170202102392158

## KNN Model

In [31]:
k_range = list(range(1,31))
weight_options = ["uniform", "distance"]

param_grid = dict(n_neighbors = k_range, weights = weight_options)

In [32]:
#KNN using GridSearch to find optimum KNN value

knn = KNeighborsClassifier() 
opt_knn = GridSearchCV(knn, param_grid, cv=2)
opt_knn.fit(X_train,y_train)

GridSearchCV(cv=2, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                         13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                         23, 24, 25, 26, 27, 28, 29, 30],
                         'weights': ['uniform', 'distance']})

In [33]:
# check knn best parameter

opt_knn.best_params_

{'n_neighbors': 12, 'weights': 'distance'}

In [34]:
# generate predictions
predictions1 = opt_knn.predict(X_test)

In [35]:
opt_knn.score(X_train, y_train)

0.9293478260869565

In [36]:
opt_knn.score(X_test, y_test)   

0.5149051490514905

In [37]:
f1_score2 = f1_score(y_test, predictions1,average='weighted')
f1_score2 

0.49983779625370117

In [78]:
precision2 = precision_score(y_test, predictions1, average='weighted',zero_division='warn')
precision2 

  _warn_prf(average, modifier, msg_start, len(result))


0.5092730810032968

## SVM Model

In [40]:
# Instantiate support vector machine.
svc = SVC()

In [41]:
gs1 = GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf', 'poly'), 'degree':[2]})
gs1.fit(X_train,y_train);



In [42]:
predictions2 = gs1.predict(X_test)

In [43]:
gs1.score(X_train, y_train)

0.7590579710144928

In [44]:
gs1.score(X_test, y_test)

0.6016260162601627

In [45]:
f1_score3 = f1_score(y_test, predictions2,average='weighted')
f1_score3

0.5629585219156645

In [79]:
precision3 = precision_score(y_test, predictions2, average='weighted', zero_division='warn')
precision3 

  _warn_prf(average, modifier, msg_start, len(result))


0.5750412048871787

## Random Forests Model




In [47]:
rf = RandomForestClassifier(n_estimators=100)

In [80]:
pre_score = cross_val_score(estimator = rf,
                            X = X_train, 
                            y = y_train,
                            scoring = 'accuracy',
                            cv = 2,
                            verbose = 0)

print('Random Forest mean score: %5.4f' %np.mean(pre_score))


Random Forest mean score: 0.5453


In [81]:
# gridsearch for random forests

rf_params = {
    'n_estimators': [100, 150, 200],
    'max_depth': [None, 1, 2, 3, 4, 5],
}
gs = GridSearchCV(rf, param_grid=rf_params, cv=2)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.5416666666666667


{'max_depth': None, 'n_estimators': 150}

In [82]:
# Random Forests using GridSearchCV

gs.score(X_train, y_train)

0.9293478260869565

In [83]:
# Random Forests using GridSearchCV

gs.score(X_test, y_test)

0.5799457994579946

In [84]:
predictions3 = gs.predict(X_test)

In [85]:
f1_score4 = f1_score(y_test, predictions3,average='weighted')
f1_score4

0.549451199881517

In [86]:
precision4 = precision_score(y_test, predictions3, average='weighted', zero_division='warn')
precision4

  _warn_prf(average, modifier, msg_start, len(result))


0.5658797770479412

## Logistic regression Model



In [87]:
parameters = {'C': [0.001, 0.01, 0.1, 1, 10],
              'class_weight': [None, 'balanced'],
              'penalty': ['l1', 'l2']}

In [88]:
lr = LogisticRegression(solver = 'liblinear', 
                        max_iter = 1000,
                        random_state = 42)

gs_results = GridSearchCV(estimator = lr,                                    # Specify the model we want to GridSearch.
                          param_grid = parameters,                           # Specify the grid of parameters we want to search.
                          scoring = 'accuracy',                                # Specify recall as the metric to optimize 
                          cv = 2).fit(X_train, y_train) 

In [89]:
gs_results.best_estimator_.get_params()

{'C': 0.01,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 1000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 42,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [90]:
gs_results.best_score_

0.5670289855072463

In [91]:
logit = LogisticRegression(
 C= 0.01,
 class_weight= None,
 dual= False,
 fit_intercept= True,
 intercept_scaling= 1,
 l1_ratio= None,
 max_iter= 1000,
 multi_class= 'auto',
 n_jobs= None,
 penalty= 'l1',
 random_state= 42,
 solver= 'liblinear',
 tol= 0.0001,
 verbose= 0,
 warm_start= False)

In [92]:

logit.fit(X = X_train,
          y = y_train)

LogisticRegression(C=0.01, max_iter=1000, penalty='l1', random_state=42,
                   solver='liblinear')

In [93]:
logit.score(X_train, y_train)

0.7518115942028986

In [94]:
logit.score(X_test, y_test)

0.5718157181571816

In [95]:
predictions4 = logit.predict(X_test)

In [96]:
f1_score5 = f1_score(y_test, predictions4,average='weighted')
f1_score5

0.5508437337406842

In [97]:
precision5 = precision_score(y_test, predictions4, average='weighted', zero_division='warn')
precision5 

  _warn_prf(average, modifier, msg_start, len(result))


0.5437328867088181

**Summary table for Classification of Clinical Text data into 9 classes of mutation types:**

| Model| Test Accuracy|Precision|Baseline score for largest class|
|:---------:|:---:|:--------:|:--------:|
|  Naiive Bayes |    0.600 | 0.620  |  0.379  |
|KNN|  0.515| 0.510   |0.379 |
|SVC| 0.602|  0.575 | 0.379 |
|Random Forests|0.610| 0.555   |0.379 |
|Logistic Regression|0.572| 0.544    |0.379 |


##### Naiive Bayes has the top best precision score of 0.620 respectively. The generally low score could be attributed to the imbalanced classes and less amount of data. The imbalanced classes have been tackled by using SMOT as an oversampler.  However, if more data is added and model improved, it can prove to be highly useful to classify text data into mutation types. 

## Executive summary

##### Classification of various colorectal cancer tissues is made possible with machine learning. This helps to support pathologists' increasing workload by enhancing diagnostic capabilities. At the same time, provide an avenue to train new pathologists! Apart from diagnosis, the clinical text classification platform can help pathologists to provide personalised treatments depending on the mutation type and also gear towards research as an opportunity to find novel therapies for colorectal cancer patients. This website and machine learning tools can all in all improve patients prognosis and outcomes.

## R.E.A.D website:

#### A website is created for clinical evidence text classification into 9 mutation types. A more sophisticated platform that allows pathologists to interact and annotate uploaded images for tumour/stroma classification is on the way! A forum page is also added to discuss research articles related to colorectal cancer diagnosis and treatments in addition to a blog page. 

<img src="../images/topcover.jpg" width="1000" height="50">