# Machine Learning on gut microbiota of patients with Colorectal cancer (2): Data clean

Using Predictive Analysis To Predict healthy control or CRC patients on gut microbiota data


## Identify the problem

Colorectal cancer (CRC), also known as bowel cancer, colon cancer, or rectal cancer, is the development of cancer from the colon or rectum (parts of the large intestine). Signs and symptoms may include blood in the stool, a change in bowel movements, weight loss, and fatigue. Gut microbiota play crutial role in CRC progression. Here, to investigate whether the gut microbiota could predict healthy control or CRC patients.

## Expected outcome

Since this build a model that can classify healthy control or CRC patients using two training classification:
* CRC = Patients - Present
* healthy = Control - Absent

## Objective

Since the labels in the data are discrete, the predication falls into two categories, (i.e. CRC or healthy). In machine learning this is a classification problem. 
        
> *Thus, the goal is to classify healthy control or CRC patients. To achieve this we have used machine learning classification methods to fit a function that can predict the discrete class of new input.*

## Identify data sources

The datasets contains two files:
* **metadata**: The 1st and 2nd columns in the dataset store the unique ID numbers of the samples and disease (CRC=Patients, healthy=Control), respectively. 
* **profile**: The gut microbial species level profile. 

## Loading libraries and set options 

In [1]:
import numpy as np        
import pandas as pd

## Importing Dataset

First, load the TSV and CSV file using read_table or read_csv function of Pandas, respectively

In [2]:
metadata = pd.read_csv("./dataset/metadata.csv", index_col=0)
profile = pd.read_table("./dataset/species.tsv", sep="\t")

## Inspecting the data
The first step is to visually inspect datasets.

In [3]:
metadata.head()

Unnamed: 0_level_0,disease
SampleID,Unnamed: 1_level_1
SAMD00114718,healthy
SAMD00114719,healthy
SAMD00114720,healthy
SAMD00114721,healthy
SAMD00114722,CRC


In [4]:
profile.head()

Unnamed: 0,TaxaID,SAMD00114718,SAMD00114719,SAMD00114720,SAMD00114721,SAMD00114722,SAMD00114723,SAMD00114724,SAMD00114726,SAMD00114727,...,SAMD00165024,SAMD00165025,SAMD00165026,SAMD00165027,SAMD00165028,SAMD00165029,SAMD00165030,SAMD00165031,SAMD00165032,SAMD00165033
0,s__Bacteroides_plebeius,46509517,5334509,6868169,1029678,7520,105,3515,3047974,90986,...,0,0,3902772,0,0,0,6311760,0,2286204,0
1,s__Bacteroides_dorei,8249892,230275,4054008,2029259,2318235,0,10920493,1777,4043706,...,0,138343,3005330,0,0,5136959,18113,0,242316,0
2,s__Faecalibacterium_prausnitzii,3696318,2053756,3267707,661965,350665,393585,536323,648125,1246731,...,0,2868791,1755420,5699714,0,1948287,0,152752,2584817,0
3,s__Eubacterium_eligens,3265545,182914,0,114447,546829,0,0,10419,895911,...,0,1370340,0,0,0,400810,578608,0,2600731,0
4,s__Bacteroides_ovatus,2871853,289955,1097263,110111,564558,580,51366,26697,0,...,1491036,443860,56055,2543082,0,279863,19714,361361,0,0


## Choosing only healthy or CRC samples

Selecting healthy or CRC samples to further data analysis
* filtering disease on metadata dataset

In [8]:
phen = metadata.loc[(metadata['disease'] == 'healthy') | (metadata['disease'] == 'CRC')]

phen

Unnamed: 0_level_0,disease
SampleID,Unnamed: 1_level_1
SAMD00114718,healthy
SAMD00114719,healthy
SAMD00114720,healthy
SAMD00114721,healthy
SAMD00114722,CRC
...,...
SAMD00165029,healthy
SAMD00165030,healthy
SAMD00165031,healthy
SAMD00165032,healthy


* filtering the species with low occurrrence

In [9]:
profile_trim = profile[phen.index]
profile_trim.index = profile.TaxaID

prof = profile_trim[profile_trim.apply(lambda x: np.count_nonzero(x)/len(x), axis=1) > 0.2]
prof

Unnamed: 0_level_0,SAMD00114718,SAMD00114719,SAMD00114720,SAMD00114721,SAMD00114722,SAMD00114723,SAMD00114724,SAMD00114726,SAMD00114727,SAMD00114728,...,SAMD00165024,SAMD00165025,SAMD00165026,SAMD00165027,SAMD00165028,SAMD00165029,SAMD00165030,SAMD00165031,SAMD00165032,SAMD00165033
TaxaID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
s__Bacteroides_plebeius,46509517,5334509,6868169,1029678,7520,105,3515,3047974,90986,0,...,0,0,3902772,0,0,0,6311760,0,2286204,0
s__Bacteroides_dorei,8249892,230275,4054008,2029259,2318235,0,10920493,1777,4043706,1072863,...,0,138343,3005330,0,0,5136959,18113,0,242316,0
s__Faecalibacterium_prausnitzii,3696318,2053756,3267707,661965,350665,393585,536323,648125,1246731,2339359,...,0,2868791,1755420,5699714,0,1948287,0,152752,2584817,0
s__Eubacterium_eligens,3265545,182914,0,114447,546829,0,0,10419,895911,0,...,0,1370340,0,0,0,400810,578608,0,2600731,0
s__Bacteroides_ovatus,2871853,289955,1097263,110111,564558,580,51366,26697,0,249687,...,1491036,443860,56055,2543082,0,279863,19714,361361,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
s__Klebsiella_pneumoniae,0,0,0,0,0,0,0,1544397,0,0,...,0,0,19537,0,0,0,0,0,0,0
s__Bacteroides_coprocola,0,0,0,0,0,0,0,1029096,0,0,...,0,0,0,0,0,5218937,0,0,0,0
s__Ruminococcus_lactaris,0,0,0,0,0,0,0,0,0,2800532,...,113761,0,0,0,0,761927,0,0,630160,0
s__Turicimonas_muris,0,0,0,0,0,0,0,0,0,1530,...,0,0,357,9347,0,1368,0,0,653,0


The **“info()”** method provides a concise summary of the data; from the output, it provides the type of data in each column, the number of non-null values in each column, and how much memory the data frame is using.

The method **get_dtype_counts()** will return the number of columns of each type in a DataFrame:

In [10]:
# Review data types with "info()".
phen.info()

<class 'pandas.core.frame.DataFrame'>
Index: 504 entries, SAMD00114718 to SAMD00165033
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   disease  504 non-null    object
dtypes: object(1)
memory usage: 7.9+ KB


In [11]:
# Review number of columns of each data type in a DataFrame:
phen.dtypes.value_counts()

object    1
dtype: int64

In [12]:
#check for missing variables
phen.isnull().any()

disease    False
dtype: bool

In [14]:
phen.disease.unique()

array(['healthy', 'CRC'], dtype=object)

From the results above, disease is a categorical variable, because it represents a fix number of possible values (i.e, disease. The machine learning algorithms wants numbers, and not strings, as their inputs so we need some method of coding to convert them.



## Integrating the phen and prof data

Here, we select **disease** from phen and then integrate it with prof into new dataset for the downstream analysis 

In [17]:
phen_cln = phen.iloc[:, 0].rename_axis("SampleID").reset_index()
phen_cln.head()

Unnamed: 0,SampleID,disease
0,SAMD00114718,healthy
1,SAMD00114719,healthy
2,SAMD00114720,healthy
3,SAMD00114721,healthy
4,SAMD00114722,CRC


In [18]:

prof_cln = prof.T.rename_axis("SampleID").reset_index()
prof_cln.head()


TaxaID,SampleID,s__Bacteroides_plebeius,s__Bacteroides_dorei,s__Faecalibacterium_prausnitzii,s__Eubacterium_eligens,s__Bacteroides_ovatus,s__Parabacteroides_distasonis,s__Ruminococcus_gnavus,s__Phascolarctobacterium_faecium,s__Bacteroides_uniformis,...,s__Bacteroides_finegoldii,s__Haemophilus_sp_HMSC71H05,s__Clostridium_saccharolyticum,s__Streptococcus_anginosus_group,s__Streptococcus_sp_A12,s__Klebsiella_pneumoniae,s__Bacteroides_coprocola,s__Ruminococcus_lactaris,s__Turicimonas_muris,s__Proteobacteria_bacterium_CAG_139
0,SAMD00114718,46509517,8249892,3696318,3265545,2871853,2327330,1920299,1506928,1371476,...,0,0,0,0,0,0,0,0,0,0
1,SAMD00114719,5334509,230275,2053756,182914,289955,89183,35688,0,729206,...,0,0,0,0,0,0,0,0,0,0
2,SAMD00114720,6868169,4054008,3267707,0,1097263,990122,1490407,0,1272701,...,0,0,0,0,0,0,0,0,0,0
3,SAMD00114721,1029678,2029259,661965,114447,110111,2705778,59274,0,940124,...,0,0,0,0,0,0,0,0,0,0
4,SAMD00114722,7520,2318235,350665,546829,564558,2529966,4608830,0,1888066,...,512018,137432,71548,15826,0,0,0,0,0,0


In [19]:
mdat = pd.merge(phen_cln, prof_cln, on="SampleID", how="inner")
mdat.head()

Unnamed: 0,SampleID,disease,s__Bacteroides_plebeius,s__Bacteroides_dorei,s__Faecalibacterium_prausnitzii,s__Eubacterium_eligens,s__Bacteroides_ovatus,s__Parabacteroides_distasonis,s__Ruminococcus_gnavus,s__Phascolarctobacterium_faecium,...,s__Bacteroides_finegoldii,s__Haemophilus_sp_HMSC71H05,s__Clostridium_saccharolyticum,s__Streptococcus_anginosus_group,s__Streptococcus_sp_A12,s__Klebsiella_pneumoniae,s__Bacteroides_coprocola,s__Ruminococcus_lactaris,s__Turicimonas_muris,s__Proteobacteria_bacterium_CAG_139
0,SAMD00114718,healthy,46509517,8249892,3696318,3265545,2871853,2327330,1920299,1506928,...,0,0,0,0,0,0,0,0,0,0
1,SAMD00114719,healthy,5334509,230275,2053756,182914,289955,89183,35688,0,...,0,0,0,0,0,0,0,0,0,0
2,SAMD00114720,healthy,6868169,4054008,3267707,0,1097263,990122,1490407,0,...,0,0,0,0,0,0,0,0,0,0
3,SAMD00114721,healthy,1029678,2029259,661965,114447,110111,2705778,59274,0,...,0,0,0,0,0,0,0,0,0,0
4,SAMD00114722,CRC,7520,2318235,350665,546829,564558,2529966,4608830,0,...,512018,137432,71548,15826,0,0,0,0,0,0


In [21]:
#save the cleaner version of dataframe for future analyis
mdat.to_csv('./dataset/MergeData.tsv',
            sep='\t', encoding='utf-8', index=False)

## Summary

* 151 species were selected more then 0.2 occurrence in Gastric Cancer
* 504 patients with Gastric Cancer were chosen

## Reference

* [Breast-cancer-risk-prediction](https://github.com/Jean-njoroge/Breast-cancer-risk-prediction)