In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Contents

## Problem Statement

What does your gut say?

Colorectal cancer is one of the leading causes of cancer death in the US.[*source*](https://www.cancer.org/cancer/colon-rectal-cancer/detection-diagnosis-staging/detection.html) This can be prevented with early diagnosis by taking patient biopsies and disease detection by pathologists. With slide scanning technology getting faster and more reliable, a larger volume of data becomes available to train and validate various models. In combination with clinical information, genetic expression or microarray and multi-omics data, computational pathology can assists pathologists in decision-making and also aid in training to-be-pathologists(Cui & Chang, 2021).[*source*](https://www.nature.com/articles/s41374-020-00514-0)

In this project, we will be training several models to predict and classify various tissue types in the colon. We will choose the best model to classify tissues which are usually misdiagnosed and misclassified. In addition to image classification to complement pathologists in decision-making, we will use NLP to classify clinical text data to identify genetic muatations for more personalised treatment since genome sequencing and gene expression data can be very expensive.

## Background

## Data Used

## Data Dictionary

## Colorectal cancer dataset

In [2]:
# read colorectal cancer data set 
# dataset obtained from https://www.kaggle.com/kmader/colorectal-histology-mnist/

colorectal = pd.read_csv('../data/hmnist_64_64_L.csv')


In [3]:
# check first 5 rows of dataset

colorectal.head()

Unnamed: 0,pixel0000,pixel0001,pixel0002,pixel0003,pixel0004,pixel0005,pixel0006,pixel0007,pixel0008,pixel0009,...,pixel4087,pixel4088,pixel4089,pixel4090,pixel4091,pixel4092,pixel4093,pixel4094,pixel4095,label
0,134,99,119,130,142,169,152,139,117,87,...,112,89,73,100,120,120,126,140,195,2
1,55,64,74,63,74,75,71,73,70,77,...,79,85,86,77,68,66,65,68,69,2
2,114,116,136,152,132,100,151,150,127,205,...,128,157,159,205,182,143,129,89,122,2
3,86,82,88,85,103,93,98,109,104,115,...,79,80,109,128,89,85,80,63,48,2
4,168,143,140,139,129,123,123,141,137,101,...,231,199,183,195,179,134,142,158,149,2


In [9]:
# dataset information check

colorectal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Columns: 4097 entries, pixel0000 to label
dtypes: int64(4097)
memory usage: 156.3 MB


In [10]:
# check dataset unique values

colorectal.nunique()

pixel0000    236
pixel0001    238
pixel0002    241
pixel0003    239
pixel0004    238
            ... 
pixel4092    239
pixel4093    237
pixel4094    238
pixel4095    237
label          8
Length: 4097, dtype: int64

In [11]:
# check dataset missing values

colorectal.isnull().sum().sort_values(ascending=False)

label        0
pixel2047    0
pixel1373    0
pixel1372    0
pixel1371    0
            ..
pixel2725    0
pixel2724    0
pixel2723    0
pixel2722    0
pixel0000    0
Length: 4097, dtype: int64

There are no missing values. Therefore no columns or rows will be dropped.

In [12]:
# view dataset statistics

colorectal.describe()

Unnamed: 0,pixel0000,pixel0001,pixel0002,pixel0003,pixel0004,pixel0005,pixel0006,pixel0007,pixel0008,pixel0009,...,pixel4087,pixel4088,pixel4089,pixel4090,pixel4091,pixel4092,pixel4093,pixel4094,pixel4095,label
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,...,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,137.4124,137.262,137.5234,137.9394,137.2922,136.531,136.7124,137.2592,137.3942,136.9946,...,136.8558,136.799,136.9958,137.4878,136.8152,136.7738,136.8662,136.769,136.8478,4.5
std,74.241325,74.09328,74.141781,74.101279,74.298932,74.760707,74.649521,74.667226,74.918143,75.012463,...,74.420652,74.272306,74.308294,74.06259,74.085413,74.26376,74.310709,73.854691,73.835275,2.291517
min,13.0,11.0,10.0,10.0,9.0,11.0,12.0,11.0,10.0,10.0,...,10.0,12.0,9.0,12.0,11.0,11.0,11.0,10.0,12.0,1.0
25%,75.0,74.0,76.0,76.0,74.75,73.0,74.0,74.0,74.0,74.0,...,74.0,74.0,73.75,75.0,75.0,74.0,74.0,75.0,75.0,2.75
50%,121.0,122.0,121.0,121.0,122.0,120.0,120.0,120.0,119.0,119.0,...,121.0,122.0,121.0,122.0,121.0,121.0,121.0,121.0,121.0,4.5
75%,222.0,219.25,220.0,221.0,221.0,221.0,221.0,222.0,223.0,222.0,...,220.0,220.0,221.0,220.0,218.25,218.0,218.25,220.0,219.25,6.25
max,248.0,249.0,252.0,248.0,250.0,248.0,250.0,249.0,251.0,253.0,...,253.0,254.0,252.0,250.0,249.0,249.0,249.0,249.0,250.0,8.0


In [15]:
# view datatypes

colorectal.dtypes

pixel0000    int64
pixel0001    int64
pixel0002    int64
pixel0003    int64
pixel0004    int64
             ...  
pixel4092    int64
pixel4093    int64
pixel4094    int64
pixel4095    int64
label        int64
Length: 4097, dtype: object