<a id='sec0'></a>
# Text Analysis1
- Importing Data
- <a href='#sec1'>Exemplary Text Analysis for Row3</a>
- <a href='#sec2'>Write function to get gene-ish words list and mutation type table</a>
- <a href='#sec3'>Compiling the entire text-ome - testing</a>
- <a href='#sec4'>Compiling the entire text-ome - full mutation table</a>
- <a href='#sec5'>Compiling the entire text-ome - full gene-like words table</a>
- <a href='#sec6'>Compiling the entire gene-ome - full gene table (not genome)</a>
- <a href='#sec7'>Convert Mutation_Types in Class file</a>
- <a href='#sec8'>Combined All!</a>
- <a href='#sec9'>Test with Random Forest</a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from nltk import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

sns.set_context("paper")
%matplotlib inline

<b>Importing train_text</b>

In [2]:
class_train = pd.read_csv('train_variants')
text_train = pd.read_csv("train_text", sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])

In [3]:
class_train.head()

Unnamed: 0,ID,Gene,Variation,Class
0,0,FAM58A,Truncating Mutations,1
1,1,CBL,W802*,2
2,2,CBL,Q249E,2
3,3,CBL,N454D,3
4,4,CBL,L399V,4


In [4]:
text_train.head()

Unnamed: 0,ID,Text
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...
3,3,Recent evidence has demonstrated that acquired...
4,4,Oncogenic mutations in the monomeric Casitas B...


<a id='sec1'></a>
# Exemplary Text Analysis for Row3 (<a href='#sec0'>Back To Top</a>)

In [5]:
txt1 = text_train.iloc[3, 1]

In [6]:
class_train.iloc[3, :]

ID               3
Gene           CBL
Variation    N454D
Class            3
Name: 3, dtype: object

In [7]:
txt1

'Recent evidence has demonstrated that acquired uniparental disomy (aUPD) is a novel mechanism by which pathogenetic mutations in cancer may be reduced to homozygosity. To help identify novel mutations in myeloproliferative neoplasms (MPNs), we performed a genome-wide single nucleotide polymorphism (SNP) screen to identify aUPD in 58 patients with atypical chronic myeloid leukemia (aCML; n = 30), JAK2 mutation–negative myelofibrosis (MF; n = 18), or JAK2 mutation–negative polycythemia vera (PV; n = 10). Stretches of homozygous, copy neutral SNP calls greater than 20Mb were seen in 10 (33%) aCML and 1 (6%) MF, but were absent in PV. In total, 7 different chromosomes were involved with 7q and 11q each affected in 10% of aCML cases. CBL mutations were identified in all 3 cases with 11q aUPD and analysis of 574 additional MPNs revealed a total of 27 CBL variants in 26 patients with aCML, myelofibrosis or chronic myelomonocytic leukemia. Most variants were missense substitutions in the RING

In [8]:
word_tokens = word_tokenize(txt1)
word_tokens = np.array(word_tokens)

In [9]:
print('initial leng %d' % len(word_tokens))

initial leng 6396


<i>The below stemming operation was tried but did not work well for Gene names, so not implemented</i><br>
stemmer = PorterStemmer()<br>
for i in range(len(word_tokens)):<br>
    word_tokens[i] = stemmer.stem(word_tokens[i])

In [10]:
stop_words = set(stopwords.words('english'))
txt1_words = [w for w in word_tokens if not w in stop_words]
print('After removing stop words %d' % len(txt1_words))

After removing stop words 4627


In [11]:
df1 = pd.DataFrame(txt1_words)
df1.columns = ['tokens']
df1.head()

Unnamed: 0,tokens
0,Recent
1,evidence
2,demonstrated
3,acquired
4,uniparental


In [12]:
gene_ish_pattern = r"[A-Z]{2,7}"

In [13]:
# get gene-ish words in a simple list
gene_ish_words1 = [word for word in txt1_words if re.match(gene_ish_pattern, word)]

In [14]:
len(gene_ish_words1)

401

In [15]:
# Do the same with pd.DF
gene_ish_words = df1[df1['tokens'].str.match(gene_ish_pattern)]
print(len(gene_ish_words))

401


In [16]:
gene_table = gene_ish_words.groupby('tokens').size().reset_index()
gene_table.columns = ['tokens', 'appearances']

In [17]:
gene_table.sort_values('appearances', ascending=False).head(15)

Unnamed: 0,tokens,appearances
11,CBL,99
95,UPN,38
66,MPNs,21
39,FLT3,20
52,JAK2,15
58,MF,15
89,SNP,11
80,RING,9
31,DNA,8
77,PV,7


In [18]:
mutation_patterns = ['Truncation', 'Deletion', 'Promoter','Amplification', 'Epigenetic', 'Frame', 'Overexpression',
                     'Duplication', 'Insertion','Subtype', 'Fusion', 'Splice', 'Wildtype']

In [19]:
mutation_table = pd.DataFrame(index=[mutation_patterns])
mutation_table['appearances'] = 0

In [20]:
for pattern in mutation_patterns:
    appearance = len(df1[df1['tokens'].str.contains(pattern, case=False)])
    mutation_table.loc[pattern, 'appearances'] = appearance

In [21]:
mutation_table

Unnamed: 0,appearances
Truncation,0
Deletion,6
Promoter,0
Amplification,3
Epigenetic,0
Frame,0
Overexpression,4
Duplication,1
Insertion,0
Subtype,1


<a id='sec2'></a>
# Write function to get gene-ish words list and mutation type table (<a href='#sec0'>Back To Top</a>)

In [22]:
def process_text1(text, print_on=False):
    '''
    Process the original text. Tokenize into words first, and then remove stop words and numbers
    
    INPUT:
    ======
    text : str
        A string containing a writing to be analyzed
    
    OUTPUT:
    =======
    words : list
        A list of tokenized words
        
    '''
    # Tokenize the text
    word_tokens = word_tokenize(text)
    
    # Remove some unwanted words (hyphen excluded), and numbers
    remove_list = ['.', ',', '(', ')', '[', ']', '=', '+', '>', '<', ':', ';', '%']
    word_tokens = [word for word in word_tokens if word not in remove_list]
    word_tokens = [word for word in word_tokens if (word.isnumeric() == False)]
    
    # Remove Stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in word_tokens if not w in stop_words]
    
    # print if print_on=True
    if print_on:
        print('Length Before removing stop words %d' % len(word_tokens))
        print('Length After removing stop words %d' % len(words))
    
    return words

In [23]:
# Check if it works
txt2 = process_text1(txt1)
txt2

['Recent',
 'evidence',
 'demonstrated',
 'acquired',
 'uniparental',
 'disomy',
 'aUPD',
 'novel',
 'mechanism',
 'pathogenetic',
 'mutations',
 'cancer',
 'may',
 'reduced',
 'homozygosity',
 'To',
 'help',
 'identify',
 'novel',
 'mutations',
 'myeloproliferative',
 'neoplasms',
 'MPNs',
 'performed',
 'genome-wide',
 'single',
 'nucleotide',
 'polymorphism',
 'SNP',
 'screen',
 'identify',
 'aUPD',
 'patients',
 'atypical',
 'chronic',
 'myeloid',
 'leukemia',
 'aCML',
 'n',
 'JAK2',
 'mutation–negative',
 'myelofibrosis',
 'MF',
 'n',
 'JAK2',
 'mutation–negative',
 'polycythemia',
 'vera',
 'PV',
 'n',
 'Stretches',
 'homozygous',
 'copy',
 'neutral',
 'SNP',
 'calls',
 'greater',
 '20Mb',
 'seen',
 'aCML',
 'MF',
 'absent',
 'PV',
 'In',
 'total',
 'different',
 'chromosomes',
 'involved',
 '7q',
 '11q',
 'affected',
 'aCML',
 'cases',
 'CBL',
 'mutations',
 'identified',
 'cases',
 '11q',
 'aUPD',
 'analysis',
 'additional',
 'MPNs',
 'revealed',
 'total',
 'CBL',
 'variants',


In [24]:
def get_gene_like_words(tokenized_text, gene_list=None):
    '''
    Get Gene-name like words from the a list of tokenized words
    
    INPUT:
    ======
    tokenized_text : list
        A list of tokenized words
    
    OUTPUT:
    =======
    gene_like_words : list
        A list of gene name like words in the tokenized list
    '''
    gene_ish_pattern = r"[A-Z]{2,7}"
    gene_like_words = [word for word in tokenized_text if re.match(gene_ish_pattern, word)]
    
    if gene_list is not None:
        genes = gene_list
        for gene in genes:
            for i in range(len(gene_like_words)):
                if gene in gene_like_words[i]:
                    gene_like_words[i] = gene
    
    return gene_like_words

In [25]:
glike_words = get_gene_like_words(txt2)
glike_words

['MPNs',
 'SNP',
 'JAK2',
 'MF',
 'JAK2',
 'PV',
 'SNP',
 'MF',
 'PV',
 'CBL',
 'MPNs',
 'CBL',
 'RING',
 'CBL',
 'FLT3',
 'CBL',
 'MPNs',
 'MPNs',
 'MPNs',
 'MPNs',
 'PV',
 'ET',
 'MF',
 'CML',
 'MPNs',
 'BCR-ABL',
 'CML',
 'MPNs',
 'BCR-ABL',
 'CML4',
 'JAK2',
 'PV',
 'ET',
 'MF,5⇓⇓–8',
 'MPNs',
 'JAK2',
 'FLT3.2',
 'MPL',
 'NRAS',
 'MPNs',
 'ET',
 'MF',
 'JAK2',
 'PV',
 'LOH',
 'DNA',
 'PV',
 'MPNs',
 'MPN',
 'SNP',
 'DNA',
 'RZPD',
 'GTYPE',
 'CNAT',
 'SNPs',
 'HRM',
 'PCR',
 'ABI',
 'PA',
 'CBL',
 'ATG',
 'ENSG00000110395',
 'CBL',
 'CBL_i7f',
 'CBL_i8r',
 'CBL',
 'CBL_i8f',
 'CBL_i9r',
 'CBL',
 'PCR',
 'RT-PCR',
 'RNA',
 'CBLe7F',
 'CBLe10R',
 'DNA',
 'CBL',
 'MLPA',
 'MRC',
 'CBL',
 'GFP',
 'CBL',
 'CA',
 'IL-3–dependent',
 'RPMI',
 'FBS',
 'WEHI-3B',
 'WEHI',
 'CBL',
 'DNA',
 'RPMI',
 'DMEM',
 'FBS',
 'WEHI',
 'GFP',
 'BD',
 'FACSAria',
 'CA',
 'CBL',
 'MTS',
 'CA',
 'CBL',
 'CBL',
 'CBL',
 'WEHI',
 'FLT3',
 'FLT3-specific',
 'RIPA',
 'SDS',
 'HA-ubiquitin',
 'CA',
 'CBL',
 'FL

In [26]:
def create_mutation_words_table(tokenized_text, normed=False):
    '''
    Create table for words to describe the mutation types from a list of
    tokenized words
    
    INPUT:
    ======
    text : list
        a list of tokenized words
    
    OUTPUT:
    =======
    mutation table : a list of sets
    '''
    # List of words for mutation types
    mutation_patterns = ['truncation', 'deletion', 'promoter','amplification', 'epigenetic', 'frame', 'overexpression',
                     'duplication', 'insertion','subtype', 'fusion', 'splice', 'wildtype']
    
    appearances = []
    for pattern in mutation_patterns:
        appearance = len([word for word in tokenized_text if pattern in word.lower()])
        appearances.append(appearance)
    
    if normed == 'mutation_types':
        appearances = np.array(appearances)
        if np.sum(appearances) != 0:
            appearances = appearances / np.sum(appearances)
        table = dict(zip(mutation_patterns, appearances))
    elif normed == 'total_text':
        appearances = np.array(appearances)
        appearances = appearances / len(tokenized_text)
        table = dict(zip(mutation_patterns, appearances))
    else:
        table = dict(zip(mutation_patterns, appearances))
        table['Total'] = np.sum(appearances)
    
    return table

In [27]:
create_mutation_words_table(txt2, normed='mutation_types')

{'amplification': 0.16666666666666666,
 'deletion': 0.33333333333333331,
 'duplication': 0.055555555555555552,
 'epigenetic': 0.0,
 'frame': 0.0,
 'fusion': 0.16666666666666666,
 'insertion': 0.0,
 'overexpression': 0.22222222222222221,
 'promoter': 0.0,
 'splice': 0.0,
 'subtype': 0.055555555555555552,
 'truncation': 0.0,
 'wildtype': 0.0}

<a id='sec3'></a>
# Compiling the entire text-ome - testing (<a href='#sec0'>Back To Top</a>)

In [28]:
txt3 = text_train.iloc[150, 1]

In [29]:
class_train.iloc[150, :]

ID                150
Gene             EGFR
Variation    EGFRvIII
Class               7
Name: 150, dtype: object

In [30]:
txt3

'Alterations ofthe EGFR gene occur frequently in human gliomas where the most common Is an jjj.f@.ntn@edeletion of exons 2â€”7from the extracel lular domain, resulting in a truncated mutant receptor (AEGFR or de 2-7 EGFR). We previously demonstrated that introduction of @.EGFRinto human US7MG giloblastoma cells (US7MG.@EGFR) conferred remark ably enhanced tumorigenlcity in vivo. Here, we show by cell-mixing ex periments that the enhanced tumorigenicity conferred by 4@EGFRis attributable to a growth advantage Intrinsic to cells expressing the mutant receptor. We analyzed the labeling Index of the proliferation markers 1(1-67 and bromodeoxyuridine and found that tumors derived from US7MGAEGFR cells had significantly higher labeling indexes than those of tumors derived from US7MG cells that were either naive, expressed kinase-deflcient mutants of @EGFR,or overexpressed exogenous wild type EGFR. We also utilized terminal deoxynudeotidyl transferase-medi ated nick end-labeling assays and sh

In [31]:
textome1 = txt1 + ' ' + txt3

In [32]:
tokens1 = process_text1(txt1)
tokens2 = process_text1(txt3)
tokens_agg = process_text1(textome1)

<b>Create Mutation Table</b>

In [33]:
mut_table1 = create_mutation_words_table(tokens1)
mut_table2 = create_mutation_words_table(tokens2)
mut_table_agg = create_mutation_words_table(tokens_agg)
mut_table = pd.DataFrame([mut_table1, mut_table2, mut_table_agg])

In [34]:
mut_table

Unnamed: 0,Total,amplification,deletion,duplication,epigenetic,frame,fusion,insertion,overexpression,promoter,splice,subtype,truncation,wildtype
0,18,3,6,1,0,0,3,0,4,0,0,1,0,0
1,124,37,26,2,0,10,3,2,32,1,3,5,3,0
2,142,40,32,3,0,10,6,2,36,1,3,6,3,0


In [35]:
mut_table1 = create_mutation_words_table(tokens1, normed=True)
mut_table2 = create_mutation_words_table(tokens2, normed=True)
mut_table = pd.DataFrame([mut_table1, mut_table2])

In [36]:
mut_table

Unnamed: 0,amplification,deletion,duplication,epigenetic,frame,fusion,insertion,overexpression,promoter,splice,subtype,truncation,wildtype
0,0.166667,0.333333,0.055556,0.0,0.0,0.166667,0.0,0.222222,0.0,0.0,0.055556,0.0,0.0
1,0.298387,0.209677,0.016129,0.0,0.080645,0.024194,0.016129,0.258065,0.008065,0.024194,0.040323,0.024194,0.0


<b>Create a sparse matrix for gene-ish words space</b>

In [37]:
genes = list(class_train['Gene'].unique())

In [38]:
glike_words1 = get_gene_like_words(tokens1, gene_list=genes)
glike_words2 = get_gene_like_words(tokens2, gene_list=genes)
glike_words_agg = get_gene_like_words(tokens_agg, gene_list=genes)

In [39]:
from collections import Counter

In [40]:
c1 = dict(Counter(glike_words1))
c2 = dict(Counter(glike_words2))

In [41]:
gene_table = pd.DataFrame()

In [42]:
gene_table = gene_table.append(c1, ignore_index=True)

In [43]:
gene_table = gene_table.append(c2, ignore_index=True)

In [44]:
gene_table

Unnamed: 0,ABI,AML,"AML,38",AML.15,"AML.15,23,27",AND,ATG,BCR-ABL,BD,BRAF,...,TUNEL-positive,UBI,US7MG,USA,UT,VEGF,WB,WHO,ZD1839,ZMD.82
0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,...,,,,,,,,,,
1,,,,,,3.0,,,1.0,,...,1.0,2.0,4.0,4.0,1.0,5.0,11.0,2.0,1.0,1.0


<a id='sec4'></a>
# Compiling the entire text-ome - full mutation table (<a href='#sec0'>Back To Top</a>)

In [45]:
text_train.head()

Unnamed: 0,ID,Text
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...
3,3,Recent evidence has demonstrated that acquired...
4,4,Oncogenic mutations in the monomeric Casitas B...


Create a whole list of dictionaries first and then convert to DF

In [46]:
%%time
mut_words_list = []
for i in range(len(text_train)):
    text = text_train.loc[i, 'Text']
    tokens = process_text1(text)
    mut_words = create_mutation_words_table(tokens, normed='mutatio_types')
    mut_words_list.append(mut_words)

CPU times: user 3min 44s, sys: 64.3 ms, total: 3min 44s
Wall time: 3min 45s


In [47]:
full_mutation_table = pd.DataFrame(mut_words_list)

In [48]:
full_mutation_table

Unnamed: 0,amplification,deletion,duplication,epigenetic,frame,fusion,insertion,overexpression,promoter,splice,subtype,truncation,wildtype
0,0.000000,0.527778,0.027778,0.000000,0.027778,0.166667,0.027778,0.055556,0.000000,0.138889,0.000000,0.027778,0.000000
1,0.000000,0.666667,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.166667,0.166667,0.000000,0.000000
2,0.000000,0.666667,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.166667,0.166667,0.000000,0.000000
3,0.166667,0.333333,0.055556,0.000000,0.000000,0.166667,0.000000,0.222222,0.000000,0.000000,0.055556,0.000000,0.000000
4,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,0.020833,0.458333,0.010417,0.000000,0.104167,0.062500,0.020833,0.041667,0.000000,0.135417,0.145833,0.000000,0.000000
8,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


Create a DF and it's updated as new row appended

In [49]:
full_mutation_table2 = pd.DataFrame()

In [50]:
%%time
for i in range(len(text_train)):
    text = text_train.loc[i, 'Text']
    tokens = process_text1(text)
    mut_words = create_mutation_words_table(tokens, normed='mutation_types')
    full_mutation_table2 = full_mutation_table2.append(mut_words, ignore_index=True)

CPU times: user 3min 42s, sys: 61.9 ms, total: 3min 42s
Wall time: 3min 42s


In [51]:
full_mutation_table2

Unnamed: 0,amplification,deletion,duplication,epigenetic,frame,fusion,insertion,overexpression,promoter,splice,subtype,truncation,wildtype
0,0.000000,0.527778,0.027778,0.000000,0.027778,0.166667,0.027778,0.055556,0.000000,0.138889,0.000000,0.027778,0.000000
1,0.000000,0.666667,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.166667,0.166667,0.000000,0.000000
2,0.000000,0.666667,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.166667,0.166667,0.000000,0.000000
3,0.166667,0.333333,0.055556,0.000000,0.000000,0.166667,0.000000,0.222222,0.000000,0.000000,0.055556,0.000000,0.000000
4,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,0.020833,0.458333,0.010417,0.000000,0.104167,0.062500,0.020833,0.041667,0.000000,0.135417,0.145833,0.000000,0.000000
8,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [52]:
full_mutation_table.equals(full_mutation_table2)

True

Two methods gave identical result and were equally fast. It seemed like CPU was heating up more with the latter case, I'll use the whole list method.

<a id='sec5'></a>
# Compiling the entire text-ome - full gene-like words table (<a href='#sec0'>Back To Top</a>)

In [53]:
text_train.head()

Unnamed: 0,ID,Text
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...
3,3,Recent evidence has demonstrated that acquired...
4,4,Oncogenic mutations in the monomeric Casitas B...


In [54]:
genes = list(class_train['Gene'].unique())

In [55]:
%%time
glike_words_list = []
for i in range(len(text_train)):
    text = text_train.loc[i, 'Text']
    tokens = process_text1(text)
    glike_words = get_gene_like_words(tokens, gene_list=genes)
    c = dict(Counter(glike_words))
    glike_words_list.append(c)

CPU times: user 3min 53s, sys: 74.9 ms, total: 3min 53s
Wall time: 3min 53s


In [56]:
glike_words_table = pd.DataFrame(glike_words_list)

In [57]:
glike_words_table

Unnamed: 0,AA,AA-3,AA-3'5,AA-30,AA-3555,AA-3V,AA-3′,AA-A00N,AA-N131Y,AA-V600EB-Raf,...,ZW3,ZWILCH,ZWINT,ZYX,ZZ,ZZ-TAZ2,ZZ-type,ZZO,ZZQ,ZZZQ
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


<a id='sec6'></a>
# Compiling the entire gene-ome - full gene table (not genome)(<a href='#sec0'>Back To Top</a>)
- This is NOT the gene-like words from the text
- This shows which gene is annotated for each ID in the 'variants' file

In [58]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [59]:
X_gene = np.array(class_train.Gene)
X_gene_int = LabelEncoder().fit_transform(X_gene.ravel()).reshape(-1, 1)
X_gene_bin = OneHotEncoder().fit_transform(X_gene_int).toarray()

In [60]:
X_gene_int

array([[ 85],
       [ 39],
       [ 39],
       ..., 
       [221],
       [221],
       [221]])

In [61]:
full_gene_table = pd.DataFrame(X_gene_bin)

In [62]:
full_gene_table

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,254,255,256,257,258,259,260,261,262,263
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
full_gene_table.loc[:, 39].head(10)

0    0.0
1    1.0
2    1.0
3    1.0
4    1.0
5    1.0
6    1.0
7    1.0
8    1.0
9    1.0
Name: 39, dtype: float64

<a id='sec7'></a>
# Convert Mutation_Types in Class file (<a href='#sec0'>Back To Top</a>)
- Import convert_mutation_type
- Use the label encoding to make it a sparse matrix

In [64]:
def convert_mutation_type(data):
    '''
    Convert the 'Variant' Data into mutation_type in a new column, returns the new data with a new column

    Input
    =====
    data : DataFrame
        The train or test data containing Variant information

    Output
    ======
    data : DataFrame
        'mutation_type' is added to the original data from the input
    '''
    # Copy the Variation into a new column (this could be just an empty copy with Nones)
    data['mutation_type'] = data['Variation']

    # Define regex pattern for point mutants
    point_mutation_pattern = \
        r"[ARNDCEQGHILKMFPSTWYV]{1}[0-9]{1,4}[ARNDCEQGHILKMFPSTWYV*]?$"

    # Define new mutation types
    major_types = ['Truncation', 'Point Mutation', 'Deletion', 'Promoter Mutations',
       'Amplification', 'Epigenetic', 'Frame Shift', 'Overexpression',
       'Deletion-Insertion', 'Duplication', 'Insertion',
       'Gene Subtype', 'Fusion', 'Splice', 'Copy Number Loss', 'Wildtype']

    # Convert the Variant information to mutation types
    data.loc[(data['Variation'].str.match(point_mutation_pattern)), 'mutation_type']= 'Point Mutation'
    data.loc[(data['Variation'].str.contains('missense', case=False)), 'mutation_type']= 'Point Mutation'
    data.loc[(data['Variation'].str.contains('fusion', case=False)), 'mutation_type']= 'Fusion'
    data.loc[(data['Variation'].str.contains('deletion', case=False)), 'mutation_type']= 'Deletion'
    data.loc[((data['Variation'].str.contains('del', case=False))\
            &(data['Variation'].str.contains('delins', case=False) == False)),
            'mutation_type']= 'Deletion'
    data.loc[((data['Variation'].str.contains('ins', case=False))\
            &(data['Variation'].str.contains('delins', case=False) == False)),
            'mutation_type']= 'Insertion'
    data.loc[((data['Variation'].str.contains('del', case=False))\
            &(data['Variation'].str.contains('delins', case=False))),
            'mutation_type']= 'Deletion-Insertion'
    data.loc[(data['Variation'].str.contains('dup', case=False)), 'mutation_type']= 'Duplication'
    data.loc[(data['Variation'].str.contains('trunc', case=False)), 'mutation_type']= 'Truncation'
    data.loc[(data['Variation'].str.contains('fs', case=False)), 'mutation_type']= 'Frame Shift'
    data.loc[(data['Variation'].str.contains('splice', case=False)), 'mutation_type']= 'Splice'
    data.loc[(data['Variation'].str.contains('exon', case=False)), 'mutation_type']= 'Point Mutation'
    data.loc[((data['Variation'].str.contains('EGFR', case=False))\
            |(data['Variation'].str.contains('AR', case=True))\
            |(data['Variation'].str.contains('MYC-nick', case=True))\
            |(data['Variation'].str.contains('TGFBR1', case=True))\
            |(data['Variation'].str.contains('CASP8L', case=True))),
            'mutation_type']= 'Gene Subtype'
    data.loc[((data['Variation'].str.contains('Hypermethylation', case=False))\
            |(data['Variation'].str.contains('Epigenetic', case=False))),
             'mutation_type']= 'Epigenetic'
    data.loc[(data['mutation_type'].isin(major_types) == False),
            'mutation_type']= 'Others'

    # rearrange order of columns
    if 'Class' in data.columns:
        data = data[['ID', 'Gene', 'Variation', 'mutation_type', 'Class']]
    else:
        data = data[['ID', 'Gene', 'Variation', 'mutation_type']]

    return data

In [65]:
new_table = convert_mutation_type(class_train)

In [66]:
X_mtype = np.array(new_table['mutation_type'])
X_mtype_int = LabelEncoder().fit_transform(X_mtype.ravel()).reshape(-1, 1)
X_mtype_bin = OneHotEncoder().fit_transform(X_mtype_int).toarray()

In [67]:
X_mtype_int

array([[15],
       [12],
       [12],
       ..., 
       [ 7],
       [12],
       [12]])

In [68]:
full_mtype_table = pd.DataFrame(X_mtype_bin)

In [69]:
full_mtype_table

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


<a id='sec8'></a>
# Combined All! (<a href='#sec0'>Back To Top</a>)

In [70]:
full_mutation_table = full_mutation_table.fillna(value=0)
glike_words_table = glike_words_table.fillna(value=0)
full_gene_table = full_gene_table.fillna(value=0)
full_mtype_table = full_mtype_table.fillna(value=0)

In [71]:
features = pd.concat([full_mutation_table, 
                      glike_words_table,
                      full_gene_table,
                      full_mtype_table],
                      axis=1)

In [72]:
features.shape

(3321, 47977)

In [73]:
class_train.Class.shape

(3321,)

<a id='sec9'></a>
# Test with Random Forest (<a href='#sec0'>Back To Top</a>)

In [74]:
X = np.array(features).astype(float)
y = np.array(class_train.Class).astype(int).ravel()

In [75]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

In [76]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [77]:
%%time
rfc = RandomForestClassifier(n_estimators=50, max_depth=30)
rfc.fit(X_train, y_train)

CPU times: user 10 s, sys: 83.8 ms, total: 10.1 s
Wall time: 10.1 s


In [78]:
y_pred = rfc.predict(X_test)

In [79]:
print(accuracy_score(y_test, y_pred))

0.670676691729


<a id='sec10'></a>
# Test with Simple SVM (<a href='#sec0'>Back To Top</a>)

In [80]:
from sklearn.preprocessing import scale
from sklearn.svm import LinearSVC

In [81]:
X_scale = scale(X)

In [82]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_scale, y, test_size=0.2)

In [85]:
%%time
clf = LinearSVC()
clf.fit(X_train2, y_train2)

CPU times: user 31min 37s, sys: 282 ms, total: 31min 38s
Wall time: 31min 43s


In [86]:
y_pred2 = clf.predict(X_test2)

In [87]:
print(accuracy_score(y_test2, y_pred2))

0.470676691729
