<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Comparing-human-lung-data-to-mouse-tissues" data-toc-modified-id="Comparing-human-lung-data-to-mouse-tissues-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Comparing human lung data to mouse tissues</a></span></li><li><span><a href="#Load-mouse-data" data-toc-modified-id="Load-mouse-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load mouse data</a></span></li><li><span><a href="#Load-human-data" data-toc-modified-id="Load-human-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load human data</a></span></li><li><span><a href="#Load-human-mouse-correspondance-data" data-toc-modified-id="Load-human-mouse-correspondance-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Load human-mouse correspondance data</a></span></li><li><span><a href="#Combine-data" data-toc-modified-id="Combine-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Combine data</a></span><ul class="toc-item"><li><span><a href="#Normalize-Separately" data-toc-modified-id="Normalize-Separately-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Normalize Separately</a></span></li><li><span><a href="#Normalize-Together" data-toc-modified-id="Normalize-Together-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Normalize Together</a></span></li><li><span><a href="#Pairwise-transform" data-toc-modified-id="Pairwise-transform-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Pairwise transform</a></span></li></ul></li><li><span><a href="#Plots:-Normalized-Boxplot,-PCA,-Pearson-matrix" data-toc-modified-id="Plots:-Normalized-Boxplot,-PCA,-Pearson-matrix-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Plots: Normalized Boxplot, PCA, Pearson matrix</a></span></li><li><span><a href="#Classifiers" data-toc-modified-id="Classifiers-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Classifiers</a></span><ul class="toc-item"><li><span><a href="#Decision-Tree" data-toc-modified-id="Decision-Tree-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Decision Tree</a></span></li><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Random Forest</a></span></li><li><span><a href="#KNN" data-toc-modified-id="KNN-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>KNN</a></span></li><li><span><a href="#Naive-Bayes" data-toc-modified-id="Naive-Bayes-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Naive Bayes</a></span></li><li><span><a href="#SVC-variations" data-toc-modified-id="SVC-variations-7.5"><span class="toc-item-num">7.5&nbsp;&nbsp;</span>SVC variations</a></span></li></ul></li><li><span><a href="#Feature-Selection" data-toc-modified-id="Feature-Selection-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Feature Selection</a></span><ul class="toc-item"><li><span><a href="#SelectKBest,-SelectPercentile" data-toc-modified-id="SelectKBest,-SelectPercentile-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>SelectKBest, SelectPercentile</a></span></li><li><span><a href="#Select-From-Model" data-toc-modified-id="Select-From-Model-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Select From Model</a></span></li><li><span><a href="#Recursive-Feature-Elimination" data-toc-modified-id="Recursive-Feature-Elimination-8.3"><span class="toc-item-num">8.3&nbsp;&nbsp;</span>Recursive Feature Elimination</a></span></li><li><span><a href="#Pipelines" data-toc-modified-id="Pipelines-8.4"><span class="toc-item-num">8.4&nbsp;&nbsp;</span>Pipelines</a></span></li><li><span><a href="#Grid-Search-to-find-best-parameters-for-each-model" data-toc-modified-id="Grid-Search-to-find-best-parameters-for-each-model-8.5"><span class="toc-item-num">8.5&nbsp;&nbsp;</span>Grid Search to find best parameters for each model</a></span><ul class="toc-item"><li><span><a href="#SVC-Grid-Search" data-toc-modified-id="SVC-Grid-Search-8.5.1"><span class="toc-item-num">8.5.1&nbsp;&nbsp;</span>SVC Grid Search</a></span></li><li><span><a href="#KNN-Grid-Search" data-toc-modified-id="KNN-Grid-Search-8.5.2"><span class="toc-item-num">8.5.2&nbsp;&nbsp;</span>KNN Grid Search</a></span></li><li><span><a href="#Random-Forest-Grid-Search" data-toc-modified-id="Random-Forest-Grid-Search-8.5.3"><span class="toc-item-num">8.5.3&nbsp;&nbsp;</span>Random Forest Grid Search</a></span></li></ul></li></ul></li><li><span><a href="#Highly-expressed-proteins" data-toc-modified-id="Highly-expressed-proteins-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Highly expressed proteins</a></span></li></ul></div>

# Comparing human lung data to mouse tissues
Variations tested:
* Based on protein abundance
* Normalizing all data together
* iBAQ abundance values

Variations to test:
* Based on peptide abundance
* Normalizing mouse and human data separately
* LFQ abundance values

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import Classification_Utils as cu
import MaxQuant_Postprocessing_Functions as mq
import pandas as pd
from sklearn.decomposition import PCA, NMF
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

# Load mouse data

In [3]:
mouse_protein_file = "D:\proteinGroups.txt"

mouse_protein_df = mq.load_df(mouse_protein_file)
mouse_protein_df = mq.clean_weakly_identified(mouse_protein_df)
mouse_protein_df = mq.remove_dup_proteinIDs(mouse_protein_df)

mouse_iBAQ_df = mq.slice_by_column(mouse_protein_df, 'protein', 'iBAQ ')
mouse_LFQ_df = mq.slice_by_column(mouse_protein_df, 'protein', 'LFQ')

mouse_iBAQ_df.columns = cu.rename_columns(mouse_iBAQ_df, 'Adult', 'Mouse')
mouse_LFQ_df.columns = cu.rename_columns(mouse_LFQ_df, 'Adult', 'Mouse')

mouse_groups = ['Brain', 'Heart', 'Kidney', 'Liver', 'Lung']
mouse_organ_to_columns = {}
mouse_organ_counts = {} 

mouse_iBAQ_df['Majority protein IDs'] = mouse_iBAQ_df['Majority protein IDs'].str[:-6] # strip off '_Mouse'
mouse_LFQ_df['Majority protein IDs'] = mouse_LFQ_df['Majority protein IDs'].str[:-6] # strip off '_Mouse'
mouse_iBAQ_df.set_index('Majority protein IDs', inplace = True)
mouse_LFQ_df.set_index('Majority protein IDs', inplace = True)

# Load human data

* Human dataset info:
    * Instrument: QExactHF03
    * Separation Type: LC-Waters-Formic_3hr
    * Tool: MSGFPlus_MzMl
    * Jobs: 1498824-1498852
    * Param file: MSGFDB_PartTryp_MetOx_StatCysAlk_10ppmParTol.txt
    * Unlabelled samples

In [4]:
human_lung_protein_file = r'F:\Human_Lung_Raw_Files\LungMAP\combined\txt\human_lung_proteinGroups.txt'
human_groups = ['Human_Lung']

human_lung_df = mq.load_df(human_lung_protein_file)
human_lung_df = mq.clean_weakly_identified(human_lung_df)
human_lung_df = mq.remove_dup_proteinIDs(human_lung_df)
        
human_lung_iBAQ_df = mq.slice_by_column(human_lung_df, 'protein', 'iBAQ ') 
human_lung_LFQ_df = mq.slice_by_column(human_lung_df, 'protein', 'LFQ')
    
human_lung_organ_columns = {}
human_lung_organ_counts = {} 

human_lung_iBAQ_df['Majority protein IDs'] = human_lung_iBAQ_df['Majority protein IDs'].str[:-6]
human_lung_LFQ_df['Majority protein IDs'] = human_lung_LFQ_df['Majority protein IDs'].str[:-6]
human_lung_iBAQ_df.set_index('Majority protein IDs', inplace = True)
human_lung_LFQ_df.set_index('Majority protein IDs', inplace = True)

# Load human-mouse correspondance data

In [5]:
mapping_file = r'D:\Human_Mouse_Mapping.txt'
mapping_df = pd.read_csv(mapping_file, usecols=['Matched Term', 'Symbol', 'Species'], sep='\t', lineterminator='\r', encoding = 'latin1')
mapping_df = mapping_df.replace(r'\n','', regex=True)

# Filter out entries not containing human in the "Species" column
mapping_df = mapping_df[mapping_df['Species'].isnull() | mapping_df['Species'].str.contains('Human')]
mapping_df.set_index('Matched Term', inplace=True)
mapping_df.drop(['Species'], axis=1, inplace=True)

mapping_df['Symbol'].replace(to_replace=' (includes others)', value='', inplace=True) # remove trailing comments

In [6]:
#########################
#
# Change mouse proteinIDs to common symbol
#
#########################

mouse_proteins = mouse_iBAQ_df.index.values.tolist()
human_proteins = human_lung_iBAQ_df.index.values.tolist()
raw_mappings = mapping_df.to_dict('index') # {mouse protein: {'Symbol': common protein}}
mappings = {}

# Break up rows with multiple mouse proteins
for old_key, val in raw_mappings.items():
    keys = old_key.split()
    for new_key in keys:
        mappings[new_key] = raw_mappings[old_key]
        
mouse_iBAQ_df.reset_index(inplace=True)

for protein in mouse_proteins:
    if protein not in human_proteins:
        to_replace = protein + '_MOUSE'
        if to_replace in mappings:
            mapping = mappings[to_replace]
            new_sym = mapping['Symbol']
            mouse_iBAQ_df.replace(protein, new_sym, inplace=True)
        
mouse_iBAQ_df.set_index('Majority protein IDs', inplace=True)

In [7]:
print(mouse_iBAQ_df.head())

                      iBAQ Mouse_04_Liver  iBAQ Mouse_05_Liver  \
Majority protein IDs                                             
1433B                          80377000.0          106810000.0   
1433E                         251680000.0          225180000.0   
1433F                          32883000.0           46963000.0   
1433G                         175610000.0          166310000.0   
1433S                          53834000.0           62327000.0   

                      iBAQ Mouse_06_Liver  iBAQ Mouse_07_Brain  \
Majority protein IDs                                             
1433B                         129430000.0         6.599400e+08   
1433E                         266450000.0         1.231800e+09   
1433F                          44594000.0         7.019100e+08   
1433G                         193140000.0         1.754000e+09   
1433S                          93074000.0         5.072200e+08   

                      iBAQ Mouse_07_Heart  iBAQ Mouse_07_Kidney  \
Majorit

# Combine data 

## Normalize Together 

In [8]:
#########################
#
# Join mouse data to human data
#
#########################

combined_df = mouse_iBAQ_df.join(human_lung_iBAQ_df)

all_organs = ['Mouse.*Brain', 'Mouse.*Heart', 'Mouse.*Kidney', 'Mouse.*Liver', 'Mouse.*Lung', 'Human_Lung']
organs_to_columns = {}
organs_to_observed_counts = {}

combined_df = mq.filter_low_observed(combined_df, all_organs, organs_to_columns, organs_to_observed_counts)
mq.log2_normalize(combined_df)
mq.median_normalize(combined_df)
combined_df = mq.reorder_columns(combined_df, all_organs, organs_to_columns)

  df.iloc[:,:] = np.log2(df.iloc[:,:])


# Plots: Normalized Boxplot, PCA, Pearson matrix

In [9]:
base_dir = r'D:\Images\Classifier\\'
combined_dir = base_dir + 'Human_Lung_Mouse_Tissues_'
combined_color_mapping = mq.map_colors(all_organs, organs_to_columns)

mq.make_seaborn_boxplot(combined_df, combined_dir, 'Median Normalized Boxplot', combined_color_mapping)

combined_df = mq.impute_missing(combined_df)

all_columns = combined_df.columns.values.tolist()

In [10]:
combined_pca, combined_pca_data = mq.do_pca(combined_df, 'protein')

combined_per_var, combined_labels = mq.make_scree_plot(combined_pca, combined_dir)
mq.draw_pca_graph2(all_columns, combined_pca_data, combined_dir, combined_color_mapping, combined_per_var, combined_labels, all_organs, organs_to_columns)

  "matplotlib is currently using a non-GUI backend, "


In [None]:
mq.make_pearson_matrix(combined_df, combined_dir, dimensions=(20,15))

# Classifiers 

In [12]:
#########################
#
# Split off mouse data for training and human data for testing
#
#########################

human_lung_cols = human_lung_iBAQ_df.columns.values.tolist()
mouse_cols = mouse_iBAQ_df.columns.values.tolist()

mouse_data = combined_df[mouse_cols].T
human_lung_data = combined_df[human_lung_cols].T

In [13]:
mouse_organs_to_columns = {k:v for (k,v) in organs_to_columns.items() if 'Mouse' in k}
human_organs_to_columns = {k:v for (k,v) in organs_to_columns.items() if 'Human' in k}

In [14]:
#########################
#
# Get mouse (training) labels and human (test) labels
#
#########################

mouse_labels = cu.get_labels(mouse_cols, mouse_organs_to_columns)
mouse_labels = [label.replace('Mouse.*', '') for label in mouse_labels]

human_lung_labels = cu.get_labels(human_lung_cols, human_organs_to_columns)
human_lung_labels = [label.replace('Human_', '') for label in human_lung_labels]

## Decision Tree

In [15]:
dt = cu.decisiontree_model_crossval(mouse_data, mouse_labels, 4)

Scores: [ 0.8  0.9  1.   1. ]
Accuracy: 0.93 (+/- 0.17)


In [18]:
dt_pred = cu.make_test_prediction(dt, human_lung_data, human_lung_labels)

print("\n")
cu.show_prediction_probabilities(dt, human_lung_data, 0)

score 0.137931034483
pred ['Liver' 'Liver' 'Liver' 'Liver' 'Liver' 'Liver' 'Liver' 'Liver' 'Lung'
 'Lung' 'Liver' 'Lung' 'Liver' 'Liver' 'Liver' 'Liver' 'Lung' 'Liver'
 'Liver' 'Liver' 'Liver' 'Liver' 'Liver' 'Liver' 'Liver' 'Liver' 'Liver'
 'Liver' 'Liver']
actual ['Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung']


Prediction probabilities for sample:
Brain : 0.0
Heart : 0.0
Kidney : 0.0
Liver : 1.0
Lung : 0.0


## Random Forest

In [19]:
rf = cu.randomforest_model_crossval(mouse_data, mouse_labels, 4)

Scores: [ 1.  1.  1.  1.]
Accuracy: 1.00 (+/- 0.00)


In [20]:
rf_pred = cu.make_test_prediction(rf, human_lung_data, human_lung_labels)

print("\n")
cu.show_prediction_probabilities(rf, human_lung_data, 0)

score 0.0344827586207
pred ['Brain' 'Brain' 'Brain' 'Brain' 'Brain' 'Brain' 'Brain' 'Brain' 'Brain'
 'Brain' 'Brain' 'Brain' 'Brain' 'Brain' 'Brain' 'Brain' 'Brain' 'Brain'
 'Brain' 'Brain' 'Brain' 'Liver' 'Brain' 'Brain' 'Brain' 'Brain' 'Lung'
 'Brain' 'Brain']
actual ['Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung']


Prediction probabilities for sample:
Brain : 0.4
Heart : 0.0
Kidney : 0.1
Liver : 0.1
Lung : 0.4


## KNN

In [19]:
knn = cu.knn_model_crossval(mouse_data, mouse_labels, 4)

Scores: [ 1.  1.  1.  1.]
Accuracy: 1.00 (+/- 0.00)


In [20]:
knn_pred = cu.make_test_prediction(knn, human_lung_data, human_lung_labels)

print("\n")
cu.show_prediction_probabilities(knn, human_lung_data, 4)

score 1.0
pred ['Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung'
 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung'
 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung']
actual ['Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung']


Prediction probabilities for sample:
Brain : 0.0
Heart : 0.0
Kidney : 0.0
Liver : 0.0
Lung : 1.0


## Naive Bayes

In [21]:
gnb = cu.bayes_gaussian_model_crossval(mouse_data, mouse_labels, 4)

Scores: [ 1.  1.  1.  1.]
Accuracy: 1.00 (+/- 0.00)


In [22]:
gnb_pred = cu.make_test_prediction(gnb, human_lung_data, human_lung_labels)

print("\n")
cu.show_prediction_probabilities(gnb, human_lung_data, 0)

score 1.0
pred ['Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung'
 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung'
 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung']
actual ['Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung']


Prediction probabilities for sample:
Brain : 0.0
Heart : 0.0
Kidney : 1.41204033987e-294
Liver : 0.0
Lung : 1.0


## SVC variations

In [23]:
models = cu.SVC_models_crossval(mouse_data, mouse_labels, 4)

Scores: [ 1.  1.  1.  1.]
Accuracy: 1.00 (+/- 0.00)
Scores: [ 1.  1.  1.  1.]
Accuracy: 1.00 (+/- 0.00)
Scores: [ 0.2  0.2  0.2  0.2]
Accuracy: 0.20 (+/- 0.00)
Scores: [ 1.  1.  1.  1.]
Accuracy: 1.00 (+/- 0.00)


In [24]:
svc_pred = cu.make_test_prediction(models[0], human_lung_data, human_lung_labels)

print("\n")
cu.show_prediction_probabilities(models[0], human_lung_data, 0)

score 1.0
pred ['Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung'
 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung'
 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung']
actual ['Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung']


Prediction probabilities for sample:
Brain : 0.102949129301
Heart : 0.0507749348014
Kidney : 0.169038769535
Liver : 0.0794758217675
Lung : 0.597761344595


# Feature Selection 

* SelectKBest
* SelectPercentile
* Recursive elimination
* SelectFromModel

* Feature selection + Transformation + Classifier --> Pipeline
* Grid Search for best hyperparameters

## SelectKBest, SelectPercentile

In [25]:
from sklearn.feature_selection import SelectKBest, SelectPercentile

print('Original data:', mouse_data.shape)

kbest_data = SelectKBest(k=25).fit_transform(mouse_data, mouse_labels)
print('SelectKBest:', kbest_data.shape)

percentile_data = SelectPercentile(percentile=25).fit_transform(mouse_data, mouse_labels)
print('SelectPercentile:', percentile_data.shape)

Original data: (30, 2218)
SelectKBest: (30, 25)
SelectPercentile: (30, 555)


## Select From Model
* Classifier computes feature importances and discards irrelevant features

In [26]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

etc = ExtraTreesClassifier()
etc = etc.fit(mouse_data, mouse_labels)

model = SelectFromModel(etc, prefit=True)
from_model_data = model.transform(mouse_data)
print('Select From Model:', from_model_data.shape)

Select From Model: (30, 39)


## Recursive Feature Elimination
* Classifier computes feature importances and discards irrelevant features
* Time comsuming in high-dimensionality space. Could also cause over-fitting if it keeps all features and there are fewer samples than features

In [27]:
from sklearn.feature_selection import RFECV

# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=etc, step=1,
              scoring='accuracy')
rfecv.fit(mouse_data, mouse_labels)

print('Total Features:', len(mouse_data.T))
print('Optimal number of features:',  rfecv.n_features_)


Total Features: 2218
Optimal number of features: 2218


## Pipelines
* Chain together feature elimination, reduction, and classification

In [28]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

anova_filter = SelectPercentile(percentile=10)
clf = KNeighborsClassifier()

pca = PCA()
lda = LinearDiscriminantAnalysis(n_components=2)

anova_knn_pipeline = Pipeline([('anova', anova_filter), 
                               ('feature_reduction', lda),
                               ('knn', clf)])

anova_knn_pipeline.fit(mouse_data, mouse_labels)
pipeline_pred = cu.make_test_prediction(anova_knn_pipeline, human_lung_data, human_lung_labels)

score 1.0
pred ['Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung'
 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung'
 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung']
actual ['Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung']




In [29]:
model_filter = SelectFromModel(etc)
clf = KNeighborsClassifier()

model_knn_pipeline = Pipeline([('model', model_filter), 
                               ('pca', PCA()),
                               ('knn', clf)])

model_knn_pipeline.fit(mouse_data, mouse_labels)
pipeline_pred = cu.make_test_prediction(model_knn_pipeline, human_lung_data, human_lung_labels)

score 1.0
pred ['Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung'
 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung'
 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung' 'Lung']
actual ['Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung', 'Lung']


## Grid Search to find best parameters for each model

In [30]:
from sklearn import grid_search
from sklearn.svm import SVC

parameters = {'kernel': ('linear', 'rbf', 'poly'), 
              'C':[1.5, 10, 100, 1000]}
svr = SVC()
clf = grid_search.GridSearchCV(svr, parameters)
clf.fit(mouse_data, mouse_labels)

clf.best_params_



{'C': 1.5, 'kernel': 'linear'}

### SVC Grid Search

In [15]:
SVC_grid = cu.svc_grid_search(4, 1)

SVC_grid.fit(mouse_data, mouse_labels)

print('Best SVC parameters:\n', SVC_grid.best_params_)
print('\nBest Cross-Validation score:\n', SVC_grid.best_score_)
#print('\nBest Estimator:\n', SVC_grid.best_estimator_)

Best SVC parameters:
 {'classify__C': 1, 'classify__gamma': 0.001, 'classify__kernel': 'linear', 'reduce_dim': PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False), 'reduce_dim__n_components': 2}

Best Cross-Validation score:
 1.0


In [32]:
cu.show_prediction_probabilities(SVC_grid, human_lung_data, 0)

Prediction probabilities for sample:
Brain : 0.0963748097492
Heart : 0.0351553567414
Kidney : 0.155031518009
Liver : 0.0523517661408
Lung : 0.66108654936


### KNN Grid Search

In [33]:
knn_grid = cu.knn_grid_search(4, 1)

knn_grid.fit(mouse_data, mouse_labels)

print('Best KNN parameters:\n', knn_grid.best_params_)
print('\nBest Cross-Validation score:\n', knn_grid.best_score_)





Best KNN parameters:
 {'classify__n_neighbors': 1, 'reduce_dim': PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False), 'reduce_dim__n_components': 2}

Best Cross-Validation score:
 1.0


In [34]:
cu.show_prediction_probabilities(knn_grid, human_lung_data, 0)

Prediction probabilities for sample:
Brain : 0.0
Heart : 0.0
Kidney : 0.0
Liver : 0.0
Lung : 1.0


### Random Forest Grid Search

In [16]:
rf_grid = cu.rf_grid_search(4, 1)
#rf_grid = cu.rf_grid_search(4, 1, 'f1_micro')

rf_grid.fit(mouse_data, mouse_labels)

print('Best Random Forest parameters:\n', rf_grid.best_params_)
#print('\nBest Cross-Validation score:\n', rf_grid.best_score_)
print('\nBest F1-score:\n', rf_grid.best_score_)

Best Random Forest parameters:
 {'classify__min_samples_split': 2, 'classify__n_estimators': 25, 'reduce_dim': PCA(copy=True, iterated_power='auto', n_components=4, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False), 'reduce_dim__n_components': 4}

Best F1-score:
 1.0


In [71]:
# Get standard deviation for best model
print('Standard Deviation:', rf_grid.cv_results_['std_test_score'][rf_grid.best_index_])

Standard Deviation: 0.0


In [72]:
cu.show_prediction_probabilities(rf_grid, human_lung_data, 0)

Prediction probabilities for sample:
Brain : 0.44
Heart : 0.08
Kidney : 0.0
Liver : 0.0
Lung : 0.48


# Highly expressed proteins
* Top n proteins contributing to PCA
* Tukey test for each organ's top proteins
* Top proteins by mean abundance per organ

In [38]:
tukeydict = mq.make_tukey_dict(mouse_data.T, mouse_labels)

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


In [41]:
top_brain_proteins = mq.top_n_enriched(5, 'Brain', tukeydict)
print(list(x[0] for x in top_brain_proteins))

['ES8L2', 'TNPO2', 'SGCD', 'PAXI', 'PERE']


In [42]:
test_dict = cu.get_descending_abundances(mouse_data.T, mouse_labels)

top_liver_proteins = cu.n_most_abundant(test_dict, 'Brain', 5)
print(top_liver_proteins)

['TBB4B' 'TBA4A' 'HBA' 'G3P' 'KCRB']
