<a id='sec0'></a>

<b>List of feature sets</b>
1. '-positive' and '-negative'
2. 'gene(like)-gene(like)' fusion like words
3. mutation types (same as ones in variants file)
4. gene_like names
  - sub extraction with re.sub(r'-[A-Za-z]\*[0-9]\*\$')
5. races
6. bio-related words (very small as of now)
7. organs
8. tumor types : suffix '-oma(s)'
9. cell type : suffix '-cyte(s)', '-blast(s)', '-cyst(s)'
10. drugs : suffix '-nib' and '-mab'
11. enzymes : suffix '-ase(s)'
12. PTModifications: prefix "^glyco|^phospho|^ubiquityl|^ubiquitinat|^acetyl|^methyl|^deamin|^oxydat"
13. biomedical conditions & adjectives 1 : suffix '-ia(s)', '-is' and '-ic: 
14. biomedical conditions & adjectives 2 : prefix "^epiderm|^endothel|^onco|^hepato|^haemato|^osteo
     |^neuro|^cholecyst|^cyst|^encephal|^erythr|^gastr
     |^hist|^karyo|^kerat|^lymph|myel|^necr|^nephr|^sarco
     |^terato|^thorac|^trache|^vasculo|^hyoer|^hypo"

<b> Approach 1: Extract all the feature words first, make a unique set, count frequencies</b>
1. Combine all text from each entry to make a whole_text (<a href='#sec1'>jump there!</a>)
2. Create a set of words with hyphens (<a href='#sec2'>jump there!</a>)
  - Tokenize the whole_text
  - Use functions to get '-positive/negative' and fusion-like words
3. Create sets of other words (<a href='#sec3'>jump there!</a>)
  - Replace non-relevant characters (inclduing hyphens) in the whole_text with white space
  - Tokenize the whole_text_white
  - Get race words
  - Get drug words
  - Get protein_words(gene-like words, enzymes, and PTM words)
  - Get tissue type words (organ, tumor type, and cell type), combine them
  - Get biomedical conditions words
  - Get mutation type words
4. Combine all the sets, remove redundancies (<a href='#sec4'>jump there!</a>)
5. Iterate through each entry to count number of appearances for each feature (i.e. unique word in the non-redundant list) (<a href='#sec5'>jump there!</a>)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import feature_engineering as fe

from collections import Counter
from importlib import reload
from nltk import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

sns.set_context("paper")
%matplotlib inline

In [2]:
class_train = pd.read_csv('train_variants')
text_train = pd.read_csv("train_text", sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])

<a id='sec1'></a>
## 1. Combine all text from each entry to make a whole_text 
(<a href='#sec0'>Back To Top</a>)

In [3]:
whole_text = ''
for i in range(len(text_train)):
    text = text_train.loc[i, 'Text'] + ''
    whole_text += text

<a id='sec2'></a>
## 2. Create a set of words with hyphens (<a href='#sec0'>Back To Top</a>)

<b>Tokenize the whole_text</b>

In [4]:
%%time
# This time, retain all the characters to preserve hyphens
tokens = word_tokenize(whole_text)

CPU times: user 3min 8s, sys: 608 ms, total: 3min 9s
Wall time: 3min 9s


<b>Use functions to get '-positive/negative' and fusion-like words</b>

In [5]:
%%time
pos_neg_words = set(fe.get_positive_and_negative_words(tokens))

CPU times: user 6.05 s, sys: 6 µs, total: 6.05 s
Wall time: 6.05 s


In [6]:
%%time
fusion_like_words = set(fe.get_fusion_like_words(tokens))

CPU times: user 38.6 s, sys: 1.99 ms, total: 38.6 s
Wall time: 38.6 s


In [7]:
print('# of words with -positive and -negative: %d' % len(pos_neg_words))
print('# of fusion-like words: %d' % len(fusion_like_words))

# of words with -positive and -negative: 766
# of fusion-like words: 7713


<a id='sec3'></a>
## 3. Create sets of other words (<a href='#sec0'>Back To Top</a>)

<b>Replace non-relevant characters (inclduing hyphens) in the whole_text with white space</b>

In [8]:
%%time
# Remove irrelevant characters by replacing them with whitespace
whole_text_white = fe.replace_with_whitespace(whole_text, hyphens='on')

# Tokenize the whole_text_white
tokens_white = word_tokenize(whole_text_white)

CPU times: user 1min 54s, sys: 4.15 s, total: 1min 58s
Wall time: 1min 58s


In [9]:
%%time
# Get race words
race_words = set(fe.get_race_words(tokens_white))

#Get drug words
drug_words = set(fe.get_drug_words(tokens_white))

#Get protein_words(gene-like words, enzymes, and PTM words)
protein_words = set(fe.get_protein_words(tokens_white))

#Get tissue type words (organ, tumor type, and cell type), combine the#
tissue_type_words = set(fe.get_tissue_type_words(tokens_white))

#Get biomedical conditions words
biomed_words = set(fe.get_biomedical_words(tokens_white))

#Get mutation type words
mutation_type_words = set(fe.get_mutation_type_words(tokens_white))

CPU times: user 5min 2s, sys: 61.9 ms, total: 5min 2s
Wall time: 5min 2s


In [10]:
# Get number of words obtained from each
print('# of race words: %d' % len(race_words))
print('# of drug words: %d' % len(drug_words))
print('# of protein words: %d' % len(protein_words))
print('# of tissue type words: %d' % len(tissue_type_words))
print('# of biomed words: %d' % len(biomed_words))
print('# of mutation type words: %d' % len(mutation_type_words))

# of race words: 10
# of drug words: 202
# of protein words: 19278
# of tissue type words: 467
# of biomed words: 4217
# of mutation type words: 13


<a id='sec4'></a>
## 4. Combine all the sets, remove redundancies (<a href='#sec0'>Back To Top</a>)

In [11]:
combined_words = list(race_words) + list(drug_words) + list(protein_words) \
                + list(tissue_type_words) + list(biomed_words) + list(mutation_type_words)
features = set(combined_words)

print('# of words all combined: %d' % len(combined_words))
print('# of unique words: %d' % len(features))

# of words all combined: 24187
# of unique words: 24065


<b>Re-process the list to reduce numbers</b>

In [12]:
additional_removal = ['base', 'case', 'please', 'this', 'with', 'within', 'at', 'in', 'for']

features1 = [re.sub(r'^[\W0-9]+', '', word) for word in features]
features1 = [word for word in features1 if word not in additional_removal]
features1 = sorted(list(set(features1)))
print(len(features1))

23959


In [13]:
# save the 1st pass feature names
features_pass1 = pd.Series(features1)
features_pass1.to_csv('features_pass1.csv', index=False)

<a id='sec5'></a>
## 5. Iterate through each entry to count number of appearances for each feature (i.e. unique word in the non-redundant list) (<a href='#sec0'>Back To Top</a>)

In [14]:
feature_matrix = pd.DataFrame(index=range(len(text_train)), columns=features1)

In [15]:
# Make sure df has the same number of rows as text_train
feature_matrix.shape[0] == text_train.shape[0]

True

Trial 1 to get frequencies

In [16]:
%%time
for i in range(10):#len(text_train)):
    # Prepare each text
    text = text_train.loc[i, 'Text']
    text_white = fe.replace_with_whitespace(text, hyphens='off')
    text_white = text_white.lower()
    
    # tokenize text
    tokens_white = word_tokenize(text_white)
    
    # Count with Counter
    c = dict(Counter(tokens_white))
    
    for word in features1:
        if word in c:
            feature_matrix.loc[i, word] = c[word]
        else:
            feature_matrix.loc[i, word] = 0

CPU times: user 35.6 s, sys: 53.9 ms, total: 35.7 s
Wall time: 35.7 s


Trial 2 to get frequencies

In [17]:
%%time
for i in range(10):#len(text_train)):
    # Prepare each text
    text = text_train.loc[i, 'Text']
    text_white = fe.replace_with_whitespace(text, hyphens='off')
    text_white = text_white.lower()
    
    # tokenize text
    tokens_white = word_tokenize(text_white)
    
    for word in features1:
        feature_matrix.loc[i, word] = len([token for token in tokens_white if word in token])

CPU times: user 1min 30s, sys: 2.01 ms, total: 1min 30s
Wall time: 1min 30s


Trial 3 to get frequencies

In [18]:
%%time
entry_list = []
for i in range(10):#len(text_train)):
    # Prepare each text
    text = text_train.loc[i, 'Text']
    text_white = fe.replace_with_whitespace(text, hyphens='off')
    text_white = text_white.lower()
    
    # tokenize text
    tokens_white = word_tokenize(text_white)
    
    # Count with Counter
    c = dict(Counter(tokens_white))
    d = {key:value for key, value in c.items() if key in features1}
    entry_list.append(d)
    
feature_matrix = pd.DataFrame(entry_list)

CPU times: user 5.88 s, sys: 995 µs, total: 5.88 s
Wall time: 5.89 s


Go with Trial 3 method for the whole matrix

In [19]:
features_frequency_list = []
num_entries = len(text_train)
for i in range(num_entries):
    # Prepare each text
    text = text_train.loc[i, 'Text']
    text_white = fe.replace_with_whitespace(text, hyphens='off')
    text_white = text_white.lower()
    
    # tokenize text
    tokens_white = word_tokenize(text_white)
    
    # Count with Counter
    c = dict(Counter(tokens_white))
    d = {key:value for key, value in c.items() if key in features1}
    entry_list.append(d) #FIX THIS LATER ANR RERUN!!
    
    if (i % 300) == 0:
        print('%d / %d complete' % (i, num_entries))
    
feature_matrix = pd.DataFrame(features_frequency_list)

0 / 3321 complete
300 / 3321 complete
600 / 3321 complete
900 / 3321 complete
1200 / 3321 complete
1500 / 3321 complete
1800 / 3321 complete
2100 / 3321 complete
2400 / 3321 complete
2700 / 3321 complete
3000 / 3321 complete
3300 / 3321 complete


In [45]:
## Because I screwed up the list name above...
feature_matrix = pd.DataFrame(entry_list)
feature_matrix = feature_matrix.iloc[10:, :]
feature_matrix = feature_matrix.reindex(index=range(len(feature_matrix)))
feature_matrix = feature_matrix.fillna(value=0)
print(feature_matrix.shape)

(3321, 22190)


In [47]:
feature_matrix.to_csv('feature_matrix_pass1.csv', index=False)

Looks like some words were not picked up?

In [51]:
import random
random_features = random.sample(list(feature_matrix.columns), 20)
feature_matrix[random_features].head(55)

Unnamed: 0,fancl,sequencinganalysis,mucositis,erf,gsp50,pcnt,epr3864,kiaa0130,ie,xscale,hsc6,gtrag,oha,gpr39,gk,anxiolytic,md,syk,anb,lymphogenic
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
