<img src="../images/topcover.jpg" width="1000" height="50">

##### In earlier notebooks, it is seen that classification between Stroma and Tumour images has potential to improve diagnosis and shed light on the severity of the disease based on the Tumour: Stroma ratio. After diagnosis, treatments for cancer can be made more personalised if more information is known about the type of mutations driving colorectal cancer in the patient. Therefore, clinical text data on colorectal cancer can help to identify the possible mutation type. If treatments are available for that particular mutation, that treatment can be administered. Otherwise, research can be carried out to find novel treatments for various mutation types related to colorectal cancer. 

##### In addition, a correlation could also be associated between the tumour:stroma ratio, or morphology/texture of the cell with the type of mutation for research purposes. 

In [1]:
# imports relevant modules

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as xgboost
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

# Import CountVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
#train_colorectal = pd.read_csv('../data/train_colorectal.csv')
train_nlp = pd.read_csv('../data/train_colorectal.csv')

In [3]:
train_nlp

Unnamed: 0,ID,Text,Gene,Variation,Class
0,28,sequencing studies have identified many recurr...,TERT,C228T,7
1,31,sequencing studies have identified many recurr...,TERT,Promoter Mutations,7
2,33,the current world health organization classifi...,TERT,Amplification,2
3,34,sequencing studies have identified many recurr...,TERT,C250T,7
4,35,abstract dicer plays a critical role in micr...,DICER1,G1809R,4
...,...,...,...,...,...
916,3256,neuroblastoma the most common paediatric solid...,CASP8,Promoter Hypermethylation,4
917,3262,ret is a singlepass transmembrane receptor tyr...,RET,S891A,7
918,3269,oncogenic fusion of the ret rearranged during ...,RET,Fusions,2
919,3278,ret is a singlepass transmembrane receptor tyr...,RET,A883F,7


In [4]:
# set up data for modelling

X = train_nlp['Text']
y = train_nlp['Class']

In [5]:
# Check distribution since this is a classification problem

y.value_counts(normalize = True)

7    0.378936
1    0.187839
4    0.156352
2    0.143322
5    0.049946
6    0.048860
3    0.019544
9    0.008686
8    0.006515
Name: Class, dtype: float64

In [6]:
# split the data into the training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42
                                                    )

In [7]:

class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]



In [8]:
# instantiate our CountVectorizer with default parameter and exclude stop words

cvec = CountVectorizer(analyzer='word', tokenizer=LemmaTokenizer(), ngram_range=(1, 1))
# max_features=1000    addddd

In [9]:
# fit the vectorizer on our corpus.
cvec.fit(X_train)



CountVectorizer(tokenizer=<__main__.LemmaTokenizer object at 0x0000008A075DD550>)

In [11]:
# transform the corpus.
X_train = cvec.transform(X_train)

In [12]:
X_train

<736x82232 sparse matrix of type '<class 'numpy.int64'>'
	with 1387628 stored elements in Compressed Sparse Row format>

In [13]:
# observe x shape

X_train.shape

(736, 82232)

In [14]:

cvec.get_feature_names()[10:25]

['+a',
 '+ap',
 '+association',
 '+at',
 '+bach',
 '+bp',
 '+chx',
 '+d',
 '+dd',
 '+delptpqp',
 '+distal',
 '+dmso',
 '+dox',
 '+edel',
 '+egf']

In [15]:
# transform test
X_test = cvec.transform(X_test)

In [16]:
X_test.shape

(185, 82232)

## Naiive Bayes Model

In [17]:
# choose multinomial naiive bayes

# instantiate our model

nb = MultinomialNB()

In [18]:
# fit our model

model = nb.fit(X_train, y_train)

In [19]:
# generate our predictions

predictions = model.predict(X_test)

In [20]:
# accuracy score of our model on the training set.

model.score(X_train, y_train)

0.7771739130434783

In [21]:
# accuracy score of our model on the testing set.

model.score(X_test, y_test)

0.5945945945945946

In [22]:
predictions 

array([7, 7, 5, 4, 4, 4, 2, 4, 1, 6, 1, 7, 6, 7, 1, 7, 2, 7, 7, 7, 1, 7,
       7, 7, 7, 7, 2, 7, 4, 7, 4, 5, 7, 4, 7, 4, 5, 7, 7, 4, 7, 1, 7, 2,
       7, 1, 4, 7, 9, 7, 4, 2, 1, 4, 5, 7, 7, 1, 7, 5, 7, 4, 2, 1, 5, 7,
       2, 2, 7, 7, 2, 2, 7, 7, 7, 7, 4, 1, 7, 7, 1, 7, 9, 7, 7, 1, 7, 1,
       7, 4, 7, 4, 3, 1, 2, 4, 1, 7, 7, 7, 5, 1, 7, 7, 7, 1, 2, 7, 7, 7,
       7, 7, 7, 4, 4, 1, 7, 7, 7, 2, 1, 2, 1, 1, 1, 2, 2, 1, 7, 2, 1, 1,
       6, 5, 3, 7, 7, 7, 1, 7, 6, 7, 7, 7, 4, 1, 7, 4, 1, 2, 1, 7, 4, 1,
       1, 7, 7, 5, 7, 7, 1, 1, 7, 1, 2, 2, 4, 7, 7, 1, 3, 7, 1, 2, 7, 1,
       7, 1, 7, 7, 1, 3, 7, 7, 2], dtype=int64)

In [36]:
f1_score1 = f1_score(y_test, predictions, average='weighted')
f1_score1

0.5795299621757659

## KNN Model

In [37]:
k_range = list(range(1,31))
weight_options = ["uniform", "distance"]

param_grid = dict(n_neighbors = k_range, weights = weight_options)

In [38]:
#KNN using GridSearch to find optimum KNN value

knn = KNeighborsClassifier() 
opt_knn = GridSearchCV(knn, param_grid, cv=5)
opt_knn.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                         13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                         23, 24, 25, 26, 27, 28, 29, 30],
                         'weights': ['uniform', 'distance']})

In [39]:
# check knn best parameter

opt_knn.best_params_

{'n_neighbors': 6, 'weights': 'distance'}

In [40]:
# generate predictions
predictions1 = opt_knn.predict(X_test)

In [41]:
opt_knn.score(X_train, y_train)

0.9279891304347826

In [42]:
opt_knn.score(X_test, y_test)   

0.5243243243243243

In [67]:
f1_score2 = f1_score(y_test, predictions1,average='weighted')
f1_score2 

0.5080688207003996

## SVM Model

In [44]:
# Instantiate support vector machine.
svc = SVC()

In [45]:
gs1 = GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf', 'poly'), 'degree':[2]})
gs1.fit(X_train,y_train);

In [46]:
predictions2 = gs1.predict(X_test)

In [47]:
gs1.score(X_train, y_train)

0.7635869565217391

In [48]:
gs1.score(X_test, y_test)

0.6162162162162163

In [68]:
f1_score3 = f1_score(y_test, predictions2,average='weighted')
f1_score3

0.5712320911561898

## Random Forests Model




In [50]:
rf = RandomForestClassifier(n_estimators=100)

In [51]:
pre_score = cross_val_score(estimator = rf,
                            X = X_train, 
                            y = y_train,
                            scoring = 'accuracy',
                            cv = 10,
                            verbose = 0)

print('Random Forest mean score: %5.4f' %np.mean(pre_score))




Random Forest mean score: 0.6047


In [52]:
# gridsearch for random forests

rf_params = {
    'n_estimators': [100, 150, 200],
    'max_depth': [None, 1, 2, 3, 4, 5],
}
gs = GridSearchCV(rf, param_grid=rf_params, cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.6073175216032359


{'max_depth': None, 'n_estimators': 150}

In [53]:
# Random Forests using GridSearchCV

gs.score(X_train, y_train)

0.9293478260869565

In [54]:
# Random Forests using GridSearchCV

gs.score(X_test, y_test)

0.5837837837837838

In [55]:
predictions3 = gs.predict(X_test)

In [69]:
f1_score4 = f1_score(y_test, predictions3,average='weighted')
f1_score4

0.5549556680055967

## Logistic regression Model



In [57]:
parameters = {'C': [0.001, 0.01, 0.1, 1, 10],
              'class_weight': [None, 'balanced'],
              'penalty': ['l1', 'l2']}

In [58]:
lr = LogisticRegression(solver = 'liblinear', 
                        max_iter = 1000,
                        random_state = 42)

gs_results = GridSearchCV(estimator = lr,                                    # Specify the model we want to GridSearch.
                          param_grid = parameters,                           # Specify the grid of parameters we want to search.
                          scoring = 'accuracy',                                # Specify recall as the metric to optimize 
                          cv = 5).fit(X_train, y_train) 

In [59]:
gs_results.best_estimator_.get_params()

{'C': 0.001,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 1000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 42,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [60]:
gs_results.best_score_

0.5842158485015628

In [61]:
logit = LogisticRegression(
 C= 0.01,
 class_weight= None,
 dual= False,
 fit_intercept= True,
 intercept_scaling= 1,
 l1_ratio= None,
 max_iter= 1000,
 multi_class= 'auto',
 n_jobs= None,
 penalty= 'l1',
 random_state= 42,
 solver= 'liblinear',
 tol= 0.0001,
 verbose= 0,
 warm_start= False)

In [62]:

logit.fit(X = X_train,
          y = y_train)

LogisticRegression(C=0.01, max_iter=1000, penalty='l1', random_state=42,
                   solver='liblinear')

In [63]:
logit.score(X_train, y_train)

0.7581521739130435

In [64]:
logit.score(X_test, y_test)

0.5405405405405406

In [65]:
predictions4 = logit.predict(X_test)

In [70]:
f1_score5 = f1_score(y_test, predictions4,average='weighted')
f1_score5

0.5153100562347647

**Summary table for Classification of Clinical Text data into 9 classes of mutation types:**

| Model| Test Accuracy|F1 score|Baseline score for largest class|
|:---------:|:---:|:--------:|:--------:|
|  Naiive Bayes |    0.594 | 0.580   |  0.379  |
|KNN|  0.524| 0.508   |0.379 |
|SVC| 0.616|  0.571 | 0.379 |
|Random Forests|0.584| 0.555   |0.379 |
|Logistic Regression|0.541| 0.515    |0.379 |


##### SVM has the best score of 0.616. The generally low score could be attributed to the unbalanced classes and less amount of data. However, if more data is added and model improved, it could prove to be useful to classify text data into mutation types. This can help to filter out patients that could receive treatments that already exists for a particular mutation type. Otherwise, research is carried out to find novel therapies for other mutation types. 

## Education purposes

##### Both models at their optimum can be used to train new pathologists and aid them in making better diagnosis as it takes 10 years of training to have an eye for detail and decipher between tissue types and cancers. 

## R.E.A.D

##### All in all, both models can be used for Research, Education and Diagnostics purposes if improved and optimised. A website is created using FLASK for this multipurpose classification. A forum page is also added to discuss research articles and journals related to colorectal cancer diagnosis and treatments. 

<img src="../images/topcover.jpg" width="1000" height="50">