<a id='sec0'></a>
# Text Analysis
- Importing Data
- <a href='#sec1'>Exemplary Text Analysis for Row3</a>
- <a href='#sec2'></a>
- <a href='#sec3'></a>
- <a href='#sec4'></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from nltk import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

sns.set_context("paper")
%matplotlib inline

<b>Importing train_text</b>

In [116]:
class_train = pd.read_csv('train_variants')
text_train = pd.read_csv("train_text", sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])

In [117]:
class_train.head()

Unnamed: 0,ID,Gene,Variation,Class
0,0,FAM58A,Truncating Mutations,1
1,1,CBL,W802*,2
2,2,CBL,Q249E,2
3,3,CBL,N454D,3
4,4,CBL,L399V,4


In [118]:
text_train.head()

Unnamed: 0,ID,Text
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...
3,3,Recent evidence has demonstrated that acquired...
4,4,Oncogenic mutations in the monomeric Casitas B...


<a id='sec1'></a>
# Exemplary Text Analysis for Row3 (<a href='#sec0'>Back To Top</a>)

In [4]:
txt1 = text_train.iloc[3, 1]

In [120]:
class_train.iloc[3, :]

ID               3
Gene           CBL
Variation    N454D
Class            3
Name: 3, dtype: object

In [5]:
txt1

'Recent evidence has demonstrated that acquired uniparental disomy (aUPD) is a novel mechanism by which pathogenetic mutations in cancer may be reduced to homozygosity. To help identify novel mutations in myeloproliferative neoplasms (MPNs), we performed a genome-wide single nucleotide polymorphism (SNP) screen to identify aUPD in 58 patients with atypical chronic myeloid leukemia (aCML; n = 30), JAK2 mutation–negative myelofibrosis (MF; n = 18), or JAK2 mutation–negative polycythemia vera (PV; n = 10). Stretches of homozygous, copy neutral SNP calls greater than 20Mb were seen in 10 (33%) aCML and 1 (6%) MF, but were absent in PV. In total, 7 different chromosomes were involved with 7q and 11q each affected in 10% of aCML cases. CBL mutations were identified in all 3 cases with 11q aUPD and analysis of 574 additional MPNs revealed a total of 27 CBL variants in 26 patients with aCML, myelofibrosis or chronic myelomonocytic leukemia. Most variants were missense substitutions in the RING

In [39]:
word_tokens = word_tokenize(txt1)
word_tokens = np.array(word_tokens)

In [40]:
print('initial leng %d' % len(word_tokens))

initial leng 6396


<i>The below stemming operation was tried but did not work well for Gene names, so not implemented</i><br>
stemmer = PorterStemmer()<br>
for i in range(len(word_tokens)):<br>
    word_tokens[i] = stemmer.stem(word_tokens[i])

In [41]:
stop_words = set(stopwords.words('english'))
txt1_words = filtered_sentence = [w for w in word_tokens if not w in stop_words]
print('After removing stop words %d' % len(txt1_words))

After removing stop words 4627


In [107]:
df1 = pd.DataFrame(txt1_words)
df1.columns = ['tokens']
df1.head()

Unnamed: 0,tokens
0,Recent
1,evidence
2,demonstrated
3,acquired
4,uniparental


In [108]:
gene_ish_pattern = r"[A-Z]{2,7}"

In [109]:
gene_ish_words = df1[df1['tokens'].str.match(gene_ish_pattern)]
print(len(gene_ish_words))

401


In [114]:
gene_table = gene_ish_words.groupby('tokens').size().reset_index()
gene_table.columns = ['tokens', 'appearances']

In [124]:
gene_table.sort_values('appearances', ascending=False).head(15)

Unnamed: 0,tokens,appearances
11,CBL,99
95,UPN,38
66,MPNs,21
39,FLT3,20
52,JAK2,15
58,MF,15
89,SNP,11
80,RING,9
31,DNA,8
77,PV,7


In [86]:
mutation_patterns = ['Truncation', 'Deletion', 'Promoter','Amplification', 'Epigenetic', 'Frame', 'Overexpression',
                     'Duplication', 'Insertion','Subtype', 'Fusion', 'Splice', 'Wildtype']

In [87]:
mutation_table = pd.DataFrame(index=[mutation_patterns])
mutation_table['appearances'] = 0

In [88]:
for pattern in mutation_patterns:
    appearance = len(df1[df1[0].str.contains(pattern, case=False)])
    mutation_table.loc[pattern, 'appearances'] = appearance

In [89]:
mutation_table

Unnamed: 0,appearances
Truncation,0
Deletion,6
Promoter,0
Amplification,3
Epigenetic,0
Frame,0
Overexpression,4
Duplication,1
Insertion,0
Subtype,1
