# EPFL BIO 322 Project  
### Gene Prediction based on Mouse brain single cell gene expression profiles

### Authors:

- Simon Lee (simon.lee@epfl.ch) 
- Léa Goffinet (lea.goffinet@epfl.ch)

### Project Description

In an experiment on epigenetics and memory, Giulia Santoni (from the lab of Johannes Gr¨aff at
EPFL) measured the gene expression levels in multiple cells of a mouse brain under three different
conditions that we call KAT5, CBP and eGFP. In this challenge, the goal is to predict – as accurately
as possible – for each cell the experimental condition (KAT5, CBP or eGFP) under which it was
measured, given only the gene expression levels.

In [1]:
# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.core.interactiveshell import InteractiveShell

from tqdm import tqdm
import requests
import os
import sys
from pathlib import Path
import multiprocessing as mp
from multiprocessing import Process
import concurrent
from multiprocessing import Pool
import glob

InteractiveShell.ast_node_interactivity = "all"

In [2]:
# read raw data
df = pd.read_csv('../data/train.csv.gz', compression='gzip')
df

Unnamed: 0,Xkr4,Gm1992,Gm19938,Gm37381,Rp1,Sox17,Gm37587,Gm37323,Mrpl15,Lypla1,...,AC163611.1,AC163611.2,AC140365.1,AC124606.2,AC124606.1,AC133095.2,AC133095.1,AC234645.1,AC149090.1,labels
0,2.190380,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.293686,CBP
1,2.861984,0.0,0.506726,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.506726,KAT5
2,2.766762,0.0,0.629614,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.683511,eGFP
3,2.146434,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.432486,CBP
4,2.840049,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.164235,eGFP
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,2.108581,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.312030,KAT5
4996,3.108972,0.0,0.783343,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.518711,eGFP
4997,2.025946,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.025946,CBP
4998,1.227544,0.0,0.791370,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,KAT5


As expected single cell data is extremely sparse. Lets count to see how many non zero entries we actually have per column to get a glimpse of what might be important genes and what might not be important.

In [3]:
gene_column_headers = df.columns.values.tolist()

In [4]:
# this gets the counts of each column and drops the column accordingly
f = open('counts.txt', 'w')
for gene in gene_column_headers:
    count = (df[gene] != 0).sum()
    
    #f.write('Counts of gene in cells {} : {} \n'.format(gene, count))  # Uncomment if you want to generate text file with counts

    # new data generator: Takes 3 hrs to run!!
    # if count < 500:
    #     df = df.drop(columns=[gene])

f.close()

In [5]:
df_filtered = pd.read_csv('../data/filtered_train.csv.gz')

we can see that the filtering process removed over 22,000 genes that were seen across less than 10% of the cells. Though the threshold is a hyperparameter we believe choosing a smaller hyperparameter will be a safer bet to not throw away useful information

In [6]:
df_filtered

Unnamed: 0.1,Unnamed: 0,Xkr4,Gm19938,Mrpl15,Lypla1,Tcea1,Rgs20,Atp6v1h,Rb1cc1,Pcmtd1,...,mt-Co3,mt-Nd3,mt-Nd4,mt-Nd5,mt-Cytb,CAAA01118383.1,Vamp7,Tmlhe,AC149090.1,labels
0,0,2.190380,0.000000,0.0,0.000000,1.293686,0.000000,1.839343,1.839343,2.190380,...,2.190380,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,1.293686,CBP
1,1,2.861984,0.506726,0.0,0.000000,0.506726,1.458439,1.291817,1.091771,0.841436,...,0.000000,0.00000,0.000000,0.000000,0.506726,0.000000,0.0,0.000000,0.506726,KAT5
2,2,2.766762,0.629614,0.0,0.000000,1.012971,0.629614,0.629614,0.000000,1.012971,...,0.000000,0.00000,0.000000,0.000000,0.629614,0.000000,0.0,0.000000,1.683511,eGFP
3,3,2.146434,0.000000,0.0,0.000000,0.000000,0.000000,1.060763,0.000000,1.060763,...,0.664895,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.432486,CBP
4,4,2.840049,0.000000,0.0,0.000000,0.000000,1.132100,0.531053,0.000000,1.649491,...,0.531053,0.00000,0.000000,0.000000,0.000000,0.531053,0.0,0.000000,2.164235,eGFP
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4995,2.108581,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.644255,0.644255,...,1.312030,0.00000,0.000000,0.000000,1.032877,0.000000,0.0,0.000000,1.312030,KAT5
4996,4996,3.108972,0.783343,0.0,0.000000,0.000000,0.783343,0.783343,0.466491,1.379256,...,0.466491,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.466491,1.518711,eGFP
4997,4997,2.025946,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,2.386459,0.000000,...,2.650884,0.00000,2.386459,1.456669,1.456669,0.000000,0.0,1.456669,2.025946,CBP
4998,4998,1.227544,0.791370,0.0,0.000000,0.791370,0.000000,0.791370,0.791370,0.791370,...,2.245478,0.79137,1.227544,0.000000,2.108819,0.000000,0.0,0.000000,0.000000,KAT5


## Method 1: XGBoost

## Method 2: Sparse-Input Neural Networks