# Notes
* So far it seems like the methods we've employed (such as MDS and PCA) for predicting tissue locality work by reducing the dimension of the input data, which means that models using these methods have access to ~22,000 vectors of input data to start with. So the question I would like to pursue now is how to select the most useful (i.e. *predictive*) input vectors from the initial data set so you only work with those. 
    * A practical application of this to our work would be developing a simple quick test that predicts if someone has kidney cancer by only measuring the expressions of ~10 genes rather than the full ~20,000. 
    * This idea comes from Krishnan's *A flexible, interpretable, and accurate...* paper in which he states "the Library of Integrated Network-Baesd Cellular Signatures (LINCS)...has shown that measuring 978 'landmark' genes...costing only $5 per sample, is sufficient to then use to impute the expresssion of all other (tens of thousands of) genes." 
* **Three Methods**
    1. Univariate Selection: use sklearn's ```SelectKBest``` and ```f_classif()``` 
    2. Recursive Feature Elimination
    3. Feature Importance (with bagged decision trees)

    * [helpful article](https://machinelearningmastery.com/feature-selection-machine-learning-python/) (we love Jason Brownlee)
## Remaining Questions
* I wonder what the difference between MDS and PCA is...I sorta get it (PCA finds component vectors; MDS preserves sample-pair distances)

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
feather_path = input('Enter path to local feather file: ')
tissue_df = pd.read_csv('https://raw.githubusercontent.com/HarritonResearchLab/genomics/main/exploring/first_topics/gene_prediction/tissue_prediction/data/tissue_df.csv')
tissues = np.array(tissue_df['tissue_ordinal'])
tissue_names = np.array(list(set(tissue_df['tissue_name'])))
feather_df = pd.read_feather(feather_path)

tissue_classes = {'cerebellum':0, 
                   'placenta':1,
                   'kidney':2, 
                   'endometrium':3,
                   'liver':4,
                   'colon':5,
                   'hippocampus':6}

plt.rcParams['font.family']='serif'

### Univariate Selection

In [12]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif


test = SelectKBest(score_func=f_classif, k=50)
fit = test.fit(feather_df,tissues)

#print(fit.scores_)

features = fit.transform(feather_df)

print(features)

[[ 6.98523346  5.48902293  5.23089188 ...  5.18146746  5.19194159
   8.51885495]
 [ 6.72916043  5.65780828  5.47521172 ...  5.10367477  5.04317581
   8.57742574]
 [ 6.52877528  5.35406447  5.32371928 ...  5.31028998  4.92563156
   8.89088339]
 ...
 [ 6.38175465 11.32169233  5.59421627 ...  5.06488227  5.20592634
   8.59109236]
 [ 7.09159978 11.21679917  5.18566812 ...  4.90362287  5.47675562
   8.6841739 ]
 [ 6.55500951 11.44477865  5.14289371 ...  5.10781564  5.29746424
   8.75024302]]
