<a id='sec0'></a>
# Visualization of datasets in the BoW and Tf-Idf space using the nlp13 corpora
Idea is to use PCA and possibly tSNE to reduce the dimension of the text features represented as BoW and Tf-Idf vectors and visualize each dataset-specific document in 2D or 3D space. They can be colored by different kinds of lables (e.g. experiment types, quantification methods, species, etc)

1. <a href='#sec1'><b>Import Modules</b></a>
2. <a href='#sec2'><b>Import and Process Data</b></a>
3. <a href='#sec3'><b>Set up data processing functions</b></a>
4. <a href='#sec4'><b>Set up visualization functions</b></a>
5. <a href='#sec5'><b>Visualize in PCA space</b></a>
6. <a href='#sec6'><b>Visualize in tSNE space</b></a>

<a id='sec1'></a>
## Import Modules
<a href='#sec0'>(Back to top)</a><br>

In [1]:
import os
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from nltk.corpus import stopwords
import spacy
from gensim import corpora, models
from gensim import matutils
from nlp_utility import lemmatize_text

In [3]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

<a id='sec2'></a>
## Import and Process Data
<a href='#sec0'>(Back to top)</a><br>
1. <a href='#sec2-1'>Create DataFrames</a><br>
2. <a href='#sec2-2'>Create Dictionary and BoW & Tf-Idf corpora</a><br>

<a id='sec2-1'></a><b>1. Create DataFrames</b>

I need to create new copora because I realized 1) in nlp13, .dropna() was applied to remove rows whose 'quant_methods' was null, and 2) consequently the corpora were specific to the remaining subset. 

In [4]:
# Import DataFrame
if False:
    df = pd.read_csv('base_data/pride_table.csv').astype(str)
    print('# rows:', len(df))

    df.loc[:, 'sample_protocol'] = df.loc[:, 'sample_protocol'].replace({'Not available': np.NaN, 'nan':np.NaN})
    df.loc[:, 'data_protocol'] = df.loc[:, 'data_protocol'].replace({'Not available': np.NaN, 'nan':np.NaN})
    df.loc[:, 'description'] = df.loc[:, 'description'].replace({'Not available': np.NaN, 'nan':np.NaN})

    # Drop rows that have null text fields
    df.dropna(subset=['sample_protocol', 'data_protocol', 'description'], inplace=True)
    print('# rows:', len(df))

# rows: 5496
# rows: 4390


In [5]:
%%time
# lemmatize text
if False:
    print('Processing sample_protocol...')
    df.loc[:, 'sample_protocol'] = df.loc[:, 'sample_protocol'].apply(lambda x: lemmatize_text(x))
    
    print('Processing data_protocol...')
    df.loc[:, 'data_protocol'] = df.loc[:, 'data_protocol'].apply(lambda x: lemmatize_text(x))
    
    print('Processing description...')
    df.loc[:, 'description'] = df.loc[:, 'description'].apply(lambda x: lemmatize_text(x))

Processing sample_protocol...
Processing data_protocol...
Processing description...
CPU times: user 50min 33s, sys: 47min 52s, total: 1h 38min 25s
Wall time: 12min 29s


In [6]:
# Create feature specific DFs and save
if False:
    inst_df = df[['dataset_id', 'sample_protocol', 'data_protocol', 'description', 'instruments']].dropna()
    exp_df = df[['dataset_id', 'sample_protocol', 'data_protocol', 'description', 'exp_types']].dropna()
    pi_df = df[['dataset_id', 'sample_protocol', 'data_protocol', 'description', 'labhead_fullname']].dropna()

    with open('nlp14_data/dfs/all_fields_df.pickle', 'wb') as out_df:
        pickle.dump(df, out_df)
    
    with open('nlp14_data/dfs/instruments_df.pickle', 'wb') as out_df:
        pickle.dump(inst_df, out_df)
    
    with open('nlp14_data/dfs/exp_types_df.pickle', 'wb') as out_df:
        pickle.dump(exp_df, out_df)
    
    with open('nlp14_data/dfs/pis_df.pickle', 'wb') as out_df:
        pickle.dump(pi_df, out_df)

In [7]:
df.head(3)

Unnamed: 0,dataset_id,sample_protocol,data_protocol,description,instruments,exp_types,quant_methods,labhead_fullname
8,PXD000011,"[the, crude, membrane, tap, mouse, forebrain, ...","[data, dependent, analysis, carry, use, resolu...","[tap, kda, mda, mda, wt, control, mda, native,...","LTQ Orbitrap, instrument model",Bottom-up proteomics,,Seth Grant
23,PXD000029,"[breast, cancer, tissue, lysate, reduction, al...","[proteomic, datum, analysis, proteome, discove...","[current, prognostic, factor, insufficient, pr...",LTQ Orbitrap Velos,Shotgun proteomics,iTRAQ,Pavel Bouchal
31,PXD000041,"[gel, digest, perform, describe, standard, pro...","[the, result, spectra, analyze, mascot, matrix...","[schizosaccharomyces, pombe, eukaryotic, genom...","LTQ Orbitrap, instrument model",Bottom-up proteomics,,


In [11]:
# Save the corpora
# This time I'll only use combined protocols and that combined with description field (i.e. two corpora)
if False:
    # Protocols combined
    protocols_corpus = list(df.sample_protocol + df.data_protocol)
    with open('nlp14_data/corpora/protocols_corpus.pickle', 'wb') as outfile:
        pickle.dump(protocols_corpus, outfile)

    # All combined
    whole_corpus = list(df.sample_protocol + df.data_protocol + df.description)
    with open('nlp14_data/corpora/whole_corpus.pickle', 'wb') as outfile:
        pickle.dump(whole_corpus, outfile)

<a id='sec2-2'></a><b>2. Create Dictionary and BoW & Tf-Idf corpora</b>

In [12]:
# Load serialized corpora
if False:
    # Protocols combined
    with open('nlp14_data/corpora/protocols_corpus.pickle', 'rb') as infile:
        protocols_corpus = pickle.load(infile)

    # All combined
    with open('nlp14_data/corpora/whole_corpus.pickle', 'rb') as infile:
        whole_corpus = pickle.load(infile)

In [14]:
# Create and save dictionary using whole_corpus
if False:
    my_dictionary = corpora.Dictionary(whole_corpus)
    my_dictionary.save('nlp14_data/whole_dictionary.dict')

In [15]:
# Save BoW and Tf-Idf
if False:
    # BoW transformations and save
    protocols_bow = [my_dictionary.doc2bow(text) for text in protocols_corpus]
    whole_bow = [my_dictionary.doc2bow(text) for text in whole_corpus]
    corpora.MmCorpus.serialize('nlp14_data/bow_and_tfidf/protocols_bow.mm', protocols_bow)
    corpora.MmCorpus.serialize('nlp14_data/bow_and_tfidf/whole_bow.mm', whole_bow)
    
    # Tf-Idf transformations  and save
    protocols_tfidf_model = models.TfidfModel(protocols_bow)
    protocols_tfidf = protocols_tfidf_model[protocols_bow]
    corpora.MmCorpus.serialize('nlp14_data/bow_and_tfidf/protocols_tfidf.mm', protocols_tfidf)
    
    whole_tfidf_model = models.TfidfModel(whole_bow)
    whole_tfidf = whole_tfidf_model[whole_bow]
    corpora.MmCorpus.serialize('nlp14_data/bow_and_tfidf/whole_tfidf.mm', whole_tfidf)

<a id='sec3'></a>
## Set up data processing functions
<a href='#sec0'>(Back to top)</a><br>
1. Label encoding function
    - Sort categories by number of appearances
    - Keep top 5 in the ranking, rename all others as 'others'
    - Use LabelEncoder from sklearn to convert them to integers
    - Return one dimensional vector with the same shape as y
2. PCA transformation function
    - Takes in the whole feature space
    - PCA with n_components = 3
    - Return comp0, comp1, comp3
3. tSNE transformation function
    - Takes in the whole feature space
    - tSNE with n_components = 2
    - Return comp0, comp2

<a id='sec4'></a>
## Set up visualization functions
<a href='#sec0'>(Back to top)</a><br>

<a id='sec5'></a>
## Visualize in PCA space
<a href='#sec0'>(Back to top)</a><br>

<a id='sec6'></a>
## Visualize in tSNE space
<a href='#sec0'>(Back to top)</a><br>

<a id='sec'></a>
## TEMPLETE
<a href='#sec0'>(Back to top)</a><br>