# Using unlabelled, unfractionated datasets obtained from QExact and VOrbi instruments
* Datasets were searched against H_sapiens_Uniprot_SPROT_2017-04-12, Tryp_Pig_Bov sequence files using MSGFPlus
* Combined results with MASIC to get quantitation data

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import Classification_Utils as cu
import MaxQuant_Postprocessing_Functions as mq
from os import listdir
import pandas as pd

## Load (and combine?) data from all tissues

In [3]:
file_paths = listdir('F:\Human_peptide_data')
    
df = cu.combine_csvs(file_paths)

In [4]:
df.shape

(0, 0)

## Clean data
* Log2 transform
* Mean/Median normalize
* Impute missing values

In [5]:
mq.log2_normalize(df)

# mean normalize
df = (df - df.mean())/df.std()

df_min = df.min().min()
impute_val = df_min/2
df = df.fillna(impute_val)

## Map each column to a corresponding label

In [6]:
tissues = ['Blood_Plasma', 'Blood_Serum', 'Liver', 'Monocyte', 'Ovary', 'Pancreas', 'Substantia_Nigra', 'Temporal_Lobe']
            
tissues_to_columns = cu.map_tissues_to_columns(df, tissues)
tissues_to_columns

{'Blood_Plasma': [],
 'Blood_Serum': [],
 'Liver': [],
 'Monocyte': [],
 'Ovary': [],
 'Pancreas': [],
 'Substantia_Nigra': [],
 'Temporal_Lobe': []}

In [7]:
labels = cu.get_labels(df, tissues, tissues_to_columns)

# Sort columns by tissue type for visualization purposes

StopIteration: 

## Visualize data
* Normalized boxplots
* Scree plot
* PCA plot
* Pearson Matrix

In [8]:
image_dir = r'D:\Images\Human_Tissues\\'

column_to_color = mq.map_colors(tissues, tissues_to_columns)
column_to_color

{}

## Test various classifiers using cross-validation

### Decision Tree

### KNN

### Logistic Regression

### Naive Bayes
* Gaussian
* Multinomial

### SVC variations

### Aggregations
* Decision Tree
* Gradient Boosting

## Tune parameters of best models (if applicable)
* Check accuracy score and F1 score (measure of precision and recall)

##  Confusion matrices of best models

## Top expressed proteins/peptides per tissue

## Save model
* Save array/dataframe of features (via pickle?) along with final model
* Write script to classify new data-- load features and fit new data on them