# EPFL BIO 322 Project  
### Gene Prediction based on Mouse brain single cell gene expression profiles

### Authors:

- Simon Lee (simon.lee@epfl.ch) 
- Léa Goffinet (lea.goffinet@epfl.ch)

### Project Description

In an experiment on epigenetics and memory, Giulia Santoni (from the lab of Johannes Gr¨aff at
EPFL) measured the gene expression levels in multiple cells of a mouse brain under three different
conditions that we call KAT5, CBP and eGFP. In this challenge, the goal is to predict – as accurately
as possible – for each cell the experimental condition (KAT5, CBP or eGFP) under which it was
measured, given only the gene expression levels.

In [4]:
# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.core.interactiveshell import InteractiveShell

from tqdm import tqdm
import requests
import os
import sys
from pathlib import Path
import multiprocessing as mp
from multiprocessing import Process
import concurrent
from multiprocessing import Pool
import xgboost as xgb
import glob

InteractiveShell.ast_node_interactivity = "all"

In [5]:
# read raw data
df = pd.read_csv('../data/train.csv.gz', compression='gzip')
df

KeyboardInterrupt: 

As expected single cell data is extremely sparse. Lets count to see how many non zero entries we actually have per column to get a glimpse of what might be important genes and what might not be important.

In [None]:
gene_column_headers = df.columns.values.tolist()

run this cell only if you want to do filter and write out the counts of how many non zero columns are in each column to a txt file

In [None]:
# # this gets the counts of each column and drops the column accordingly
# f = open('../data/counts.txt', 'w')
# for gene in gene_column_headers:
#     count = (df[gene] != 0).sum()
    
#     f.write('Counts of gene in cells {} : {} \n'.format(gene, count))  # Uncomment if you want to generate text file with counts

#     # new data generator: Takes 3 hrs to run!!
#     # if count < 500:
#     #     df = df.drop(columns=[gene])

# # df.to_csv('../data/filtered_train.csv.gz, compression='gzip')

# f.close()

In [6]:
df_filtered = pd.read_csv('../data/filtered_train.csv.gz')

we can see that the filtering process removed over 22,000 genes that were seen across less than 10% of the cells. Though the threshold is a hyperparameter we believe choosing a smaller hyperparameter will be a safer bet to not throw away useful information

In [7]:
df_filtered

Unnamed: 0.1,Unnamed: 0,Xkr4,Gm19938,Mrpl15,Lypla1,Tcea1,Rgs20,Atp6v1h,Rb1cc1,Pcmtd1,...,mt-Co3,mt-Nd3,mt-Nd4,mt-Nd5,mt-Cytb,CAAA01118383.1,Vamp7,Tmlhe,AC149090.1,labels
0,0,2.190380,0.000000,0.0,0.000000,1.293686,0.000000,1.839343,1.839343,2.190380,...,2.190380,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,1.293686,CBP
1,1,2.861984,0.506726,0.0,0.000000,0.506726,1.458439,1.291817,1.091771,0.841436,...,0.000000,0.00000,0.000000,0.000000,0.506726,0.000000,0.0,0.000000,0.506726,KAT5
2,2,2.766762,0.629614,0.0,0.000000,1.012971,0.629614,0.629614,0.000000,1.012971,...,0.000000,0.00000,0.000000,0.000000,0.629614,0.000000,0.0,0.000000,1.683511,eGFP
3,3,2.146434,0.000000,0.0,0.000000,0.000000,0.000000,1.060763,0.000000,1.060763,...,0.664895,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.432486,CBP
4,4,2.840049,0.000000,0.0,0.000000,0.000000,1.132100,0.531053,0.000000,1.649491,...,0.531053,0.00000,0.000000,0.000000,0.000000,0.531053,0.0,0.000000,2.164235,eGFP
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4995,2.108581,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.644255,0.644255,...,1.312030,0.00000,0.000000,0.000000,1.032877,0.000000,0.0,0.000000,1.312030,KAT5
4996,4996,3.108972,0.783343,0.0,0.000000,0.000000,0.783343,0.783343,0.466491,1.379256,...,0.466491,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.466491,1.518711,eGFP
4997,4997,2.025946,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,2.386459,0.000000,...,2.650884,0.00000,2.386459,1.456669,1.456669,0.000000,0.0,1.456669,2.025946,CBP
4998,4998,1.227544,0.791370,0.0,0.000000,0.791370,0.000000,0.791370,0.791370,0.791370,...,2.245478,0.79137,1.227544,0.000000,2.108819,0.000000,0.0,0.000000,0.000000,KAT5


In [8]:
df_filtered = df_filtered.drop(columns='Unnamed: 0')

check for any null values incase we need to perform imputation

In [9]:
check_nan = df.isnull().values.any()
print(check_nan)

False


In [10]:
gene_column_headers = df_filtered.columns.values.tolist()

## Method 1 XGBoost

- Once with our filtered data
- Also run with raw data to see if filtering has an effect

In [11]:
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix

In [12]:
# split into train and validation set using sklearn
gene_column_headers_filtered = gene_column_headers[:-1]
y = df_filtered['labels']
X = df_filtered.iloc[:,:-1]

# need to transform our labels from [KAT5, CBP, eGFP] -> [0,1,2] 
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

In [13]:
# perform a typical split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify = y)

training_data = {'X_train':X_train,'y_train':y_train,
                'X_val': X_val,'y_val':y_val}

In [14]:
X_train.shape

(4000, 9137)

In [15]:
y_train.shape

(4000,)

In [16]:
# fit function where it takes the sklearn and xgboost models and performs the boosted trees.
# plots performance and accuracy as well

def fit(model, training_data=training_data, epochs=300, label_gene = gene_column_headers_filtered):
    # fitting to the sklearn model
    print('Fitting model...')
    model.fit(training_data['X_train'], training_data['y_train'])
    print('Fitting done!')

    # fitting the xboost library model
    train = xgb.DMatrix(training_data['X_train'], label=training_data['y_train'])
    test = xgb.DMatrix(training_data['X_test'], label=training_data['y_test'])
    params = model.get_xgb_params()
    metrics = ['mlogloss','merror']
    params['eval_metric'] = metrics
    evaluation = {}
    evallist = [(test, 'test'),(train,'train')]
    xgb_model = xgb.train(params, train, epochs, evallist, evals_result=evaluation,verbose_eval=100)

    # Model reports
    print('-- Model Report --')
    print('XGBoost Accuracy: '+str(accuracy_score(model.predict(training_data['X_test']), training_data['y_test'])))
    print('XGBoost F1-Score: '+str(f1_score(model.predict(training_data['X_test']),training_data['y_test'], average='micro')))
    
    # plotting the error curves for our loss functions
    for m in metrics:
        test_score = evaluation['test'][m]
        train_score = evaluation['train'][m]
        x = range(0, epochs)
        plt.rcParams["figure.figsize"] = [6,6]
        plt.plot(x, test_score, label="Test")
        plt.plot(x, train_score, label="Train")
        
        title_name = m + " plot"
        plt.title(title_name)
        plt.xlabel('Epoch')
        plt.ylabel(m)
        lgd = plt.legend()
        plt.show()
    
    # makes sure that the two array match so we can plot the feature importance
    print("length of features list: {}".format(len(gene_column_headers_filtered)))
    print("length of feature importance vector {}".format(len(model.feature_importances_)))

    return xgb_model

In [None]:
# hyperparameters that can be adjusted in the final model
xgb_model = XGBClassifier(learning_rate=0.1, # play around with learning rate
                    n_estimators=300, # play around with number of boosted trees built
                    max_depth=9, # play around with tree depth
                    objective='multi:softmax',  # I saw that using softmax or softprob is best for multi class classification
                    nthread=4,
                    num_class=3,
                    seed=1 # seed is included for reproducibility
                    )

xgb_trained = fit(xgb_model, training_data)

Fitting model...


read in test.csv.gz to assess model and predictions

In [None]:
test = pd.read_csv('../data/test.csv.gz', compression='gzip', usecols=gene_column_headers_filtered)
test

In [None]:
# get it into dmatrix again
X = test
testing_data = xgb.DMatrix(data=X)

In [None]:
test = xgb.DMatrix(training_data['X_test'], label=training_data['y_test'])
pred = xgb_trained.predict(test)

In [None]:
print(classification_report(training_data['y_val'], pred))

In [None]:
confusion_matrix = confusion_matrix(training_data['y_val'], pred)

In [None]:
def plot_confusion_matrix(cm, classes, normalized=True, cmap='bone'):
    norm_cm = cm
    if normalized:
        norm_cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        sns.heatmap(norm_cm, annot=cm, fmt='g', xticklabels=classes, yticklabels=classes, cmap=cmap)
# confusion matrix
plot_confusion_matrix(confusion_matrix, ['KAT5', 'CBP', 'eGFP'])

In [None]:
import graphviz
plt.rcParams["figure.figsize"] = [22,40]
xgb.plot_tree(xgb_trained)

In [None]:
plt.rcParams["figure.figsize"] = [22,22]
xgb.plot_importance(xgb_trained)

In [None]:
# hyperparameters that can be adjusted in the final model
xgb_model = XGBClassifier(learning_rate=0.1, # play around with learning rate
                    n_estimators=300, # play around with number of boosted trees built
                    max_depth=9, # play around with tree depth
                    objective='multi:softmax',  # I saw that using softmax or softprob is best for multi class classification
                    nthread=4,
                    num_class=3,
                    seed=1 # seed is included for reproducibility
                    )

xgb_trained = fit(xgb_model, training_data)

as expected alot of features aren't very important but lets still try running our classifier with our raw data 

Raw data. Assess performance

In [None]:
# split into train and validation set using sklearn
y = df['labels']
X = df.loc[:, df.columns != 'labels'] 

# need to transform our labels from [KAT5, CBP, eGFP] -> [0,1,2] 
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

In [None]:
# perform a typical split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify = y)

data = {'X_train':X_train,'y_train':y_train,
                'X_test': X_val,'y_test':y_val}

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
# hyperparameters that can be adjusted in the final model
xgb_model = XGBClassifier(learning_rate=0.1, # play around with learning rate
                    n_estimators=300, # play around with number of boosted trees built
                    max_depth=9, # play around with tree depth
                    objective='multi:softmax',  # I saw that using softmax or softprob is best for multi class classification
                    nthread=4,
                    num_class=3,
                    seed=1 # seed is included for reproducibility
                    )

xgb_trained = fit(xgb_model, training_data)

train more different hyperparameter models

In [None]:
test = xgb.DMatrix(training_data['X_test'], label=training_data['y_test'])
pred = xgb_trained.predict(test)

In [None]:
print(classification_report(training_data['y_test'], pred))

In [None]:
confusion_matrix = confusion_matrix(training_data['y_test'], pred)

In [None]:
plot_confusion_matrix(confusion_matrix, ['KAT5', 'CBP', 'eGFP'])

## Method 2: Sparse-Input Neural Networks

- Once with our filtered data
- Also run with raw data to see if filtering has an effect