<a id='sec0'></a>
# Creating more features using Noun Chunks via Spacy

In nlp13 notebooks, unigrams (or single words) were used as features for classifying quantification methods. In some cases where one or a few words clearly signify the quatn. methods used (e.g. silac, itraq), unigram features worked very well. However, they did not work well for other ones, including 'label free' and 'spectrum counting.' Here, I'm trying 'noun chunks' as features (alternatively bi- and tri-grams could be used). Rationale being, 'phrases' that lost meaning in the unigram form will be retained in this approach.

1. <a href='#sec1'><b>Import Modules</b></a>
2. <a href='#sec2'><b>Import and Pre-rocess Data</b></a>
3. <a href='#sec3'><b>Extract Noun Chunks</b></a>
4. <a href='#sec4'><b>4. Crate Dictionary and BoW & Tf-Idf corpora</b></a>

<a id='sec1'></a>
## 1. Import and Modules

In [1]:
import pandas as pd
import numpy as np
import pickle
import spacy
import re
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from nltk.corpus import stopwords
import spacy
from gensim import corpora, models
from gensim import matutils

<a id='sec2'></a>
## 2. Import and Pre-process Data
<a href='#sec0'>(Back to top)</a><br>

In [3]:
# Import DataFrame and drop rows that do not have texts
if True:
    df = pd.read_csv('base_data/pride_table.csv').astype(str)
    print('# rows initially:', len(df))
    
    # Replace 'Not available' texts with NaN
    df.loc[:, 'sample_protocol'] = df.loc[:, 'sample_protocol'].replace({'Not available': np.NaN, 'nan':np.NaN})
    df.loc[:, 'data_protocol'] = df.loc[:, 'data_protocol'].replace({'Not available': np.NaN, 'nan':np.NaN})
    df.loc[:, 'description'] = df.loc[:, 'description'].replace({'Not available': np.NaN, 'nan':np.NaN})

    # Drop rows that have null text fields
    df.dropna(subset=['sample_protocol', 'data_protocol', 'description'], inplace=True)
    print('# rows after dropna:', len(df))

# rows initially: 5496
# rows after dropna: 4390


<a id='sec3'></a>
## 3. Extract Noun Chuncks
<a href='#sec0'>(Back to top)</a><br>
1. <a href='#sec3-1'>Create and test extraction method</a><br>
2. <a href='#sec3-2'>2. Create DataFrame with Noun Chunks</a><br>

<a id='sec3-1'></a><b>1. Create and test extraction method</b>

In [4]:
nlp = spacy.load('en_core_web_lg')

In [5]:
# function to extract noun chuncks for a document
# * This function will require some tuning later for processing text
# * Ex1: replace hyphen with a space
# * Ex2: removes words like 'which'
def extract_noun_chuncks(text):
    pattern = r'^a\s|^an\s|^the\s|^each\s|^only\s|\.$|^\(|\)$| analysis$| analyses$|,'    # patterns to drop
    mz_pattern1 = r'^/z'    # Regex for fixing 'm/z' cases
    mz_pattern2 = r' m$'   # Regex for fixing 'm/z' cases
    
    doc = nlp(text)    # Create nlp doc object
    noun_chuncks = [chunk.text.lower() for chunk in doc.noun_chunks]        # Get noun_chunks
    noun_chuncks = [re.sub(pattern, '', chunk) for chunk in noun_chuncks]   # Remove some patterns
    noun_chuncks = [chunk for chunk in noun_chuncks if not re.match(mz_pattern1, chunk)]    # Remoe '/z' strings
    noun_chuncks = [re.sub(mz_pattern2, ' m/z', chunk) for chunk in noun_chuncks]
    noun_chuncks = [chunk.strip() for chunk in noun_chuncks]
    
    return noun_chuncks

In [6]:
test_text = df.sample_protocol.loc[139]
test_text

'Generation of mDia2-Based Immunocomplexes and Mass Spectrometry 293T cells were transfected with Flag-tagged full-length mDia2 (either the wild type or the MA mutant) or empty vector. Cell lysates were prepared as previously described (7, 19). One and a half milligrams of cell lysates were immunoprecipitated using anti-Flag M2® Affinity gel (Sigma-Aldrich) for 2 hours at 4°C. Beads were washed three times in NET buffer (50 mM Tris-HCl pH 7.6, 150 mM NaCl, 5 mM EDTA and 0.1% Triton X-100) supplemented with protease inhibitor cocktail (Roche), 5 mM NaF and 1 mM NaVO4. Proteins were eluted with Laemmli buffer and separated by SDS-PAGE (NuPage 4-12% Bis-Tris gradient gel (Invitrogen)). The gel was fixed and stained with Colloidal Blue according to manufacturer’s instructions (Invitrogen).   Mass Spectrometry Protein reduction and alkylation was performed in gel with DTT (56°C, 1 h) and 2-chloro-iodoacetamide (dark, RT, 30 min), respectively, after which digestion was performed with trypsi

In [7]:
sorted(extract_noun_chuncks(test_text))

['0%',
 '0.1%',
 '0.1m acetic acid',
 '0.1m acetic acid',
 '1 h',
 '1 min',
 '1 mm navo4',
 '10 min',
 '10 min',
 '10 min solvent a',
 '100% acn',
 '100% solvent b',
 '13-28% solvent b',
 '150 mm nacl',
 '180 m/z',
 '2 hours',
 '2-chloro-iodoacetamide',
 '20 min',
 '20 mm',
 '25 s',
 '28-50% solvent b',
 '3 min',
 '30 min',
 '35%',
 '350 to 1500 m/z',
 '350 to 1500 m/z',
 '37°c',
 '4-12% bis-tris gradient gel',
 '400 mm',
 '445.120025 ion',
 '45 min',
 '4°c',
 '5 mm edta',
 '5 mm naf',
 '5 µl/min',
 '50 mm tris-hcl ph',
 '50 nl/min',
 '50-100% solvent b',
 '500 counts',
 '56°c',
 '60 min lc method',
 '80% acetonitrile',
 '90 min lc method',
 '90 min lc method',
 'affinity gel',
 'agc',
 'agc',
 'agc',
 'agc',
 'agilent 1200 hplc system',
 'alkylation',
 'ammerbuch-entringen',
 'beads',
 'biosolve',
 'both cid',
 'cell lysates',
 'cell lysates',
 'charge state',
 'charge state screening',
 'colloidal blue',
 'data-dependent mode',
 'deionized water',
 'dr maisch',
 'dtt',
 'dynamic excl

<a id='sec3-2'></a><b>2. Create DataFrame with Noun Chunks</b>

In [8]:
%%time
# extrac noun_chunks from the text
if True:
    print('Processing sample_protocol...')
    df.loc[:, 'sample_protocol'] = df.loc[:, 'sample_protocol'].apply(lambda x: extract_noun_chuncks(x))
    
    print('Processing data_protocol...')
    df.loc[:, 'data_protocol'] = df.loc[:, 'data_protocol'].apply(lambda x: extract_noun_chuncks(x))
    
    print('Processing description...')
    df.loc[:, 'description'] = df.loc[:, 'description'].apply(lambda x: extract_noun_chuncks(x))

Processing sample_protocol...
Processing data_protocol...
Processing description...
CPU times: user 13min 35s, sys: 1min 43s, total: 15min 19s
Wall time: 10min 49s


In [9]:
df.head()

Unnamed: 0,dataset_id,sample_protocol,data_protocol,description,instruments,exp_types,quant_methods,labhead_fullname
8,PXD000011,"[crude membranes, 5 p56-p70 glun1tap/tap mouse...","[data-dependent, resolution, full ms spectrum,...","[tap-glun1, 840 kda, 1.5 mda, psd95-tap, 1.5 m...","LTQ Orbitrap, instrument model",Bottom-up proteomics,,Seth Grant
23,PXD000029,"[breast cancer tissue lysates, reduction, alky...","[proteomics data, proteome discoverer, fdr<0.0...","[current prognostic factors, precise risk-disc...",LTQ Orbitrap Velos,Shotgun proteomics,iTRAQ,Pavel Bouchal
31,PXD000041,"[gel, standard protocols, sds-page, washing, e...","[resulting spectra, mascot™, matrix science, s...","[hrp3_purification, schizosaccharomyces pombe ...","LTQ Orbitrap, instrument model",Bottom-up proteomics,,
50,PXD000063,"[conditioned media preparation, hplc-isobaric ...","[data analysis –, ab sciex, ms/ms spectra, pro...","[twenty million lbetat2 cells, either control,...",4800 Proteomics Analyzer,Bottom-up proteomics,,Stuart C. Sealfon
92,PXD000115,"[proteins, washed beads, 2-fold-concentrated s...","[raw files, software program maxquant, recalib...","[n-terminal protease npro, pestiviruses, innat...","LTQ Orbitrap, instrument model",Bottom-up proteomics,,Dr. Penny Powell


In [10]:
# Serialize and save the processed DF
if True:
    with open('nlp15_data/dfs/all_fields_nchunks_df.pickle', 'wb') as out_df:
        pickle.dump(df, out_df)

<a id='sec4'></a>
## 4. Create dummies for quant types
<a href='#sec0'>(Back to top)</a><br>

In [11]:
if True:
    with open('nlp15_data/dfs/all_fields_nchunks_df.pickle', 'rb') as infile_df:
        df = pickle.load(infile_df)

In [12]:
# dropna
df.replace({'nan':np.NaN}, inplace=True)
df = df.dropna().reset_index(drop=True)

In [13]:
# .lower() all the quant methods
df.loc[:, 'quant_methods'] = df.quant_methods.str.lower()

# Figure out which quant methods are most pouplar
quant_methods_added_string = ','.join(list(df.quant_methods))
methods_strings = [method.strip() for method in quant_methods_added_string.split(',')]
methods_strings = pd.Series(methods_strings)

methods_strings.value_counts()[:10]

silac                                                   501
ms1 intensity based label-free quantification method    345
spectrum counting                                       328
tmt                                                     292
itraq                                                   274
label free                                              240
tic                                                     158
normalized spectral abundance factor - nsaf              96
peptide counting                                         48
empai                                                    33
dtype: int64

In [14]:
if True:
    df['silac'] = df.quant_methods.str.contains('silac').astype(int)
    df['ms1_label_free'] = df.quant_methods.str.contains('ms1 intensity based label-free quantification method').astype(int)
    df['spectrum_counting'] = df.quant_methods.str.contains('spectrum counting').astype(int)
    df['tmt'] = df.quant_methods.str.contains('tmt').astype(int)
    df['itraq'] = df.quant_methods.str.contains('itraq').astype(int)
    df['label_free'] = df.quant_methods.str.contains('label free').astype(int)

In [15]:
df.head(2)

Unnamed: 0,dataset_id,sample_protocol,data_protocol,description,instruments,exp_types,quant_methods,labhead_fullname,silac,ms1_label_free,spectrum_counting,tmt,itraq,label_free
0,PXD000029,"[breast cancer tissue lysates, reduction, alky...","[proteomics data, proteome discoverer, fdr<0.0...","[current prognostic factors, precise risk-disc...",LTQ Orbitrap Velos,Shotgun proteomics,itraq,Pavel Bouchal,0,0,0,0,1,0
1,PXD000164,"[protein extraction, catheter biofilm small pi...","[tryptic digest, reversed phase, rp) chromatog...","[term-catheterization, catheter-associated bac...",LTQ Orbitrap Velos,Shotgun proteomics,label free,Katharina Riedel,0,0,0,0,0,1


In [16]:
# Serialize and save the processed DF
if True:
    with open('nlp15_data/dfs/all_fields_nchunks_df_quant_dummies.pickle', 'wb') as out_df:
        pickle.dump(df, out_df)

<a id='sec5'></a>
## 5. Crate Dictionary and BoW & Tf-Idf corpora
<a href='#sec0'>(Back to top)</a><br>

In [17]:
# Save the corpora
# This time I'll only use combined protocols and that combined with description field (i.e. two corpora) as in nlp13
if True:
    # Protocols combined
    protocols_corpus = list(df.sample_protocol + df.data_protocol)
    with open('nlp15_data/corpora/protocols_corpus_nchunks.pickle', 'wb') as outfile:
        pickle.dump(protocols_corpus, outfile)

    # All combined
    whole_corpus = list(df.sample_protocol + df.data_protocol + df.description)
    with open('nlp15_data/corpora/whole_corpus.pickle_nchunks', 'wb') as outfile:
        pickle.dump(whole_corpus, outfile)

In [18]:
# Create and save dictionary using whole_corpus
if True:
    my_dictionary = corpora.Dictionary(whole_corpus)
    my_dictionary.save('nlp15_data/whole_dictionary_nchunks.dict')

In [19]:
# Save BoW and Tf-Idf
if True:
    # BoW transformations and save
    protocols_bow = [my_dictionary.doc2bow(text) for text in protocols_corpus]
    whole_bow = [my_dictionary.doc2bow(text) for text in whole_corpus]
    corpora.MmCorpus.serialize('nlp15_data/bow_and_tfidf/protocols_bow_nchunks.mm', protocols_bow)
    corpora.MmCorpus.serialize('nlp15_data/bow_and_tfidf/whole_bow_nchunks.mm', whole_bow)
    
    # Tf-Idf transformations  and save
    protocols_tfidf_model = models.TfidfModel(protocols_bow)
    protocols_tfidf = protocols_tfidf_model[protocols_bow]
    corpora.MmCorpus.serialize('nlp15_data/bow_and_tfidf/protocols_tfidf_nchunks.mm', protocols_tfidf)
    
    whole_tfidf_model = models.TfidfModel(whole_bow)
    whole_tfidf = whole_tfidf_model[whole_bow]
    corpora.MmCorpus.serialize('nlp15_data/bow_and_tfidf/whole_tfidf_nchunks.mm', whole_tfidf)

In [20]:
len(df), len(protocols_tfidf), len(whole_tfidf)

(2387, 2387, 2387)

##### I think it's ok to end this notebook here and create another one for the actual analysis / classification procedure.