Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body


Tobacco use is the cause of about 22% of cancer deaths. Another 10% are due to obesity, poor diet, lack of physical activity or excessive drinking of alcohol. Other factors include certain infections, exposure to ionizing radiation and environmental pollutants. In the developing world, 15% of cancers are due to infections such as Helicobacter pylori, hepatitis B, hepatitis C, human papillomavirus infection, Epstein–Barr virus and human immunodeficiency virus (HIV). These factors act, at least partly, by changing the genes of a cell. Typically, many genetic changes are required before cancer develops. Approximately 5–10% of cancers are due to inherited genetic defects from a person's parents. Cancer can be detected by certain signs and symptoms or screening tests. It is then typically further investigated by medical imaging and confirmed by biopsy.

Although there are over 50 identifiable hereditary forms of cancer, less than 0.3% of the population are carriers of a cancer-related genetic mutation and these make up less than 3–10% of all cancer cases. The vast majority of cancers are non-hereditary ("sporadic cancers"). Hereditary cancers are primarily caused by an inherited genetic defect. A cancer syndrome or family cancer syndrome is a genetic disorder in which inherited genetic mutations in one or more genes predisposes the affected individuals to the development of cancers and may also cause the early onset of these cancers. Although cancer syndromes exhibit an increased risk of cancer, the risk varies. For some of these diseases, cancer is not the primary feature and is a rare consequence.

Many of these syndromes are caused by mutations in tumor suppressor genes that regulate cell growth. Other common mutations alter the function of DNA repair genes, oncogenes and genes involved in the production of blood vessels. Certain inherited mutations in the genes BRCA1 and BRCA2 with a more than 75% risk of breast cancer and ovarian cancer. Some of the inherited genetic disorders that can cause colorectal cancer include familial adenomatous polyposis and hereditary non-polyposis colon cancer; however, these represent less than 5% of colon cancer cases. In many cases, genetic testing can be used to identify mutated genes or chromosomes that are passed through generations.

# 1) Preliminary work

## 1.1) Installations

In [2]:
! pip install biopython



In [3]:
! pip install python-Levenshtein



In [0]:
from Bio import Entrez
from Bio.Seq import Seq
from Bio import SeqIO

import re
import collections

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib import cm

from Levenshtein import distance

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier


## 1.2) Loading data

#### 1.2.1) Non cancer genes

In [5]:
l=[]


handle = Entrez.esearch(db="nucleotide", term='"Homo sapiens"[ORGN] NOT genome NOT cancer NOT tumor', retmax=1000)
record = Entrez.read(handle, validate = False)
idlist=record['IdList']

handle2 = Entrez.efetch(db="nucleotide", id=idlist, rettype="fasta", retmode="text")

for seq_record in SeqIO.parse(handle2,'fasta'):
  l.append([seq_record.id, str(seq_record.seq), seq_record.description.split(',')[0]])
      

a=pd.DataFrame(l)
a['label']=0
a.head()

Email address is not specified.

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.


Unnamed: 0,0,1,2,label
0,MH536743.1,GTCTCATCTGCCTCCACTCGGCCTCAGTTCCTCATCACTGTTCCTG...,MH536743.1 Homo sapiens HLA-DPA1*02:01:01:NEW ...,0
1,MH536742.1,CCAGGCCCCGGGCGGGGCTCTCAGGGTCTCAGGCTCCGAGAGCCTT...,MH536742.1 Homo sapiens HLA-B*35:01:01:NEW gene,0
2,MH536741.1,CAGAAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTGGC...,MH536741.1 Homo sapiens HLA-A*23:NEW pseudogene,0
3,MH536740.1,CCAGGCCCCGGGCGGGGCTCTCAGGGTCTCAGGCTCCGAGAGCCTT...,MH536740.1 Homo sapiens HLA-B*18:01:01:NEW gene,0
4,MH536739.1,CCAGGCCCCGGGCGGGGCTCTCAGGGTCTCAGGCTCCGAGGGCCGC...,MH536739.1 Homo sapiens HLA-B*49:01:01:NEW gene,0


#### 1.2.2.) Cancer genes

In [6]:
l=[]

handle = Entrez.esearch(db="nucleotide", term='"Homo sapiens"[Organism] NOT Roswell[All Fields]) AND cancer[Title]', retmax=1000)
record = Entrez.read(handle, validate = False)
idlist=record['IdList']

handle2 = Entrez.efetch(db="nucleotide", id=idlist, rettype="fasta", retmode="text")

for seq_record in SeqIO.parse(handle2,'fasta'):
  l.append([seq_record.id, str(seq_record.seq), seq_record.description.split(',')[0]])
      

b=pd.DataFrame(l)
b['label']=1
b.head()

Email address is not specified.

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.


Unnamed: 0,0,1,2,label
0,NR_109833.1,AAATCTCAGCCTCCCACTCCCATATTTACAGTTTGATTAGGGAGGC...,NR_109833.1 Homo sapiens prostate cancer assoc...,1
1,NR_109834.1,CCGAGGTGATCAGGTGGACTTTCCTGGATGTTCTGGGTCTTGACCT...,NR_109834.1 Homo sapiens colon cancer associat...,1
2,NR_015379.3,TGACATTCTTCTGGACAATGAGTCCCATCATCTCTCCACCATGCAC...,NR_015379.3 Homo sapiens urothelial cancer ass...,1
3,NR_026941.1,AGCGGGCTGCAGGGCTGCGGGCGCTTGGTTCGGCCTGGCCCGGCCG...,NR_026941.1 Homo sapiens cancer susceptibility...,1
4,NR_026940.1,AGCGGGCTGCAGGGCTGCGGGCGCTTGGTTCGGCCTGGCCCGGCCG...,NR_026940.1 Homo sapiens cancer susceptibility...,1


In [0]:
data=pd.concat([a,b], axis=0)

## 1.3) Vectorize genes sequences 

In [0]:
vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=5, analyzer='char')

x = vectorizer.fit_transform(list(data[1]))

# 2) Random forest classifier 

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.  Random  forests correct for decision trees' habit of overfitting to their training set.

## 2.1) Creating classifier 

#### 2.2) Splitting train and test sets 

In [0]:
X_train, X_test, y_train, y_test = train_test_split( x, list(data['label']) )

#### 2.3) Training classifier 

In [11]:
mlp = RandomForestClassifier()
mlp.fit(X_train, y_train )

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## 2.2) Evaluating accuracy and performance of classifier  

Prediction on test set

In [0]:
predictions = mlp.predict(X_test)

classification report

In [13]:
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.86      0.92      0.89       240
          1       0.92      0.86      0.89       260

avg / total       0.89      0.89      0.89       500



accuracy

In [14]:
accuracy_score(y_test, predictions)

0.888

# 3) Looking for cancer genes in genome 

In [0]:
def ORF_finder(sequenses):
        """
        This function finds ORF in dna sequenses
        :param sequenses: set of DNA sequenses in dictionary format (name:sequense)
        :return: lists of open reading frames (1,2,3) and length of longest ORF
        """
        ORF=[]
        for item in sequenses:
            x=re.findall(r'ATG(?:(?!TAA|TAG|TGA)...)*(?:TAA|TAG|TGA)',item)
            ORF.extend(x)

        return ORF

## 3.1) Getting a genome 

In [16]:
handle = Entrez.esearch(db="nucleotide", term='"Homo sapiens"[ORGN] AND complete genome[title]', retmax=1)
record = Entrez.read(handle, validate = False)

idlist=record['IdList']
handle2 = Entrez.efetch(db="nucleotide", id=idlist, rettype="fasta", retmode="text")

genome=''

for seq_record in SeqIO.parse(handle2,'fasta'):
  genome= seq_record.seq

Email address is not specified.

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.


## 3.2) Finding ORF

In [0]:
n={'A':'T','C':'G','T':'A','G':'C'}

rev_g=''
for i in genome:
  rev_g=n[i]+rev_g

genes=ORF_finder([str(genome), rev_g])

## 3.3) Vectorising genes

In [0]:
a=pd.DataFrame(genes)
x = vectorizer.fit_transform(a[0].apply(lambda x: x+'N'))                    

## 3.4) Making predictions

In [23]:
mlp.predict(x) 

array([0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0])