# **Main Multiome Notebook**

This is the main notebook for multiome part of the project, where the task is to predict gene expression levels given information about TF_IDF normalized chromatin accessibility data.

In this Jupyter notebook, data from several sources is joined together and is used further to create predictions for the test dataset. The sources are:
* pre-calculated Truncated SVD values from chromatin accessibility data (see Prepare_SVD_for_multiome notebook);
* source data for three input features to be used as is;
* metadata - donor ID and day each cell was analyzed, some features are built using metadata information;
* target values for the train set.

Target includes 23000 genes. Kaggle notebook can hardly fit all the target values in available memory, and there is no possibility to fit an individual model for each of 23000 targets. So, I calculate TruncatedSVD components for the target data, predict the TruncatedSVD components and then calculate the predicted targets by using reverse operation to TruncatedSVD calculation. To further improve results, I build 4 models predicting TruncatedSVD components calculated with different random seeds and then calculate the average prediction.

In [1]:
# Importing the libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

import gc, pickle, scipy.sparse
from sklearn.decomposition import TruncatedSVD
from catboost import CatBoostRegressor
from humanize import naturalsize

In [2]:
# Need this libraby to read the *.h5 data
!pip install --quiet tables

[0m

In [3]:
DATA_DIR = "/kaggle/input/open-problems-multimodal/"
FP_CELL_METADATA = os.path.join(DATA_DIR,"metadata.csv")

FP_CITE_TRAIN_INPUTS = os.path.join(DATA_DIR,"train_cite_inputs.h5")
FP_CITE_TRAIN_TARGETS = os.path.join(DATA_DIR,"train_cite_targets.h5")
FP_CITE_TEST_INPUTS = os.path.join(DATA_DIR,"test_cite_inputs.h5")
FP_CITE_TEST_INPUTS_FIX = os.path.join(DATA_DIR,"test_cite_inputs_day_2_donor_27678.h5")

FP_MULTIOME_TRAIN_INPUTS = os.path.join(DATA_DIR,"train_multi_inputs.h5")
FP_MULTIOME_TRAIN_TARGETS = os.path.join(DATA_DIR,"train_multi_targets.h5")
FP_MULTIOME_TEST_INPUTS = os.path.join(DATA_DIR,"test_multi_inputs.h5")

FP_SUBMISSION = os.path.join(DATA_DIR,"sample_submission.csv")
FP_EVALUATION_IDS = os.path.join(DATA_DIR,"evaluation_ids.csv")

In [4]:
# Import specially prepared TruncatedSVD data and select rows related to train data.
# Only 128 components will be used, as cross-validation showed other components add little value to the model.
svd_x = pd.read_csv('../input/raw-features-for-multiome/svd.csv', dtype='float32')
svd_x = svd_x.iloc[:105942, :129]
svd_x = svd_x.add_prefix('svd_x_')
del svd_x['svd_x_Unnamed: 0']
gc.collect()

30

In [5]:
# Get column names from target data.
df_target = pd.read_hdf(FP_MULTIOME_TRAIN_TARGETS, start=0, stop=1)
target_names = df_target.columns

del df_target
gc.collect()

0

In [6]:
%%time
# Import prepared sparse matrix of target values.

train_targets = scipy.sparse.load_npz("../input/multimodal-single-cell-as-sparse-matrix/train_multi_targets_values.sparse.npz")

CPU times: user 12.9 s, sys: 912 ms, total: 13.8 s
Wall time: 19.4 s


In [7]:
def save_pca(name, model):
    with open(name, 'wb') as f:
        pickle.dump(model, f)

In [8]:
# For targets, I calculate the TruncatedSVD components in the main notebook and use pickle to save the model, 
# so that it would be possible to perform reverse operation later.
# To achieve better results, I calculate the TruncatedSVD components 4 times and will later fit 4 models and calculate the average.
for i in [2,3,4,5]:
    file_name = 'pca_targets_' + str(i) + '.pkl'
    prefix = 'svd_y_' + str(i) + '_'
    pca_targets = TruncatedSVD(n_components=64, random_state=i)
    #pca_targets = TruncatedSVD(n_components=4, random_state=i)
    t_targets = pca_targets.fit_transform(train_targets)
    save_pca(file_name, pca_targets)
    target_i = pd.DataFrame(t_targets, dtype='float32')
    target_i = target_i.add_prefix(prefix)
    if i == 2:
        target_total = target_i
    else:
        target_total = pd.concat([target_total, target_i], axis=1)

    
del t_targets, train_targets, target_i
gc.collect()
print(target_total.shape)

(105942, 256)


In [9]:
%%time

# Import metadata and select rows related to train set.
md_df = pd.read_csv(FP_CELL_METADATA, index_col='cell_id')
md_df = md_df.loc[md_df['technology'] == "multiome"]
md_df['day'] = md_df['day'].astype('int8')
del md_df['technology']
md_df = md_df.loc[(md_df['donor'] != 27678) & (md_df['day'] != 10)]
print(md_df.shape)
gc.collect()

(105942, 3)
CPU times: user 348 ms, sys: 14 ms, total: 361 ms
Wall time: 501 ms


21

In [10]:
# Import pre-selected important features to be used as is.
df_imp_cols = pd.read_parquet('../input/imp-features-for-multiome/train_corr_features.parquet')
very_imp_cols = ['svd_x_chr1:630875-631689', 'svd_x_chr1:633700-634539', 'svd_x_chr17:22520955-22521852']
df_imp_cols = df_imp_cols[very_imp_cols]
print(df_imp_cols.shape)

(105942, 3)


In [11]:
# Now join all the train data into a single dataframe.
md_df = md_df.merge(df_imp_cols, how = 'left', on = 'cell_id')
df = md_df.reset_index()
df = pd.concat([df, svd_x], axis=1)
df = pd.concat([df, target_total], axis=1)
print(df.shape)

del md_df, svd_x, target_total, df_imp_cols
gc.collect()

(105942, 391)


0

In [12]:
# Check the dataframe size.
size = df.memory_usage(deep='True').sum()
print(naturalsize(size))

178.7 MB


In [13]:
# Now import the prepared TruncatedSVD data for test dataset.
svd_test = pd.read_csv('../input/raw-features-for-multiome/svd.csv', dtype='float32')
svd_test = svd_test.iloc[105942:, :129]
svd_test = svd_test.add_prefix('svd_x_')
del svd_test['svd_x_Unnamed: 0']
svd_test = svd_test.reset_index(drop = True)
print(svd_test.shape)
gc.collect()

(55935, 128)


0

In [14]:
# Import metadata for test dataset.
md_df = pd.read_csv(FP_CELL_METADATA, index_col='cell_id')
md_df = md_df.loc[md_df['technology'] == "multiome"]
md_df['day'] = md_df['day'].astype('int8')
del md_df['technology']
md_df = md_df.loc[(md_df['donor'] == 27678) | (md_df['day'] == 10)]
print(md_df.shape)

(55935, 3)


In [15]:
# Import data for pre-selected important features (test dataset).
df_imp_cols = pd.read_parquet('../input/imp-features-for-multiome/test_corr_features.parquet')
df_imp_cols = df_imp_cols[very_imp_cols]
print(df_imp_cols.shape)

(55935, 3)


In [16]:
# Now join all the test data into a single dataframe.
md_df = md_df.merge(df_imp_cols, how = 'left', on = 'cell_id')
df_test = md_df.reset_index()
df_test = pd.concat([df_test, svd_test], axis=1)
print(df_test.shape)
del md_df, svd_test, df_imp_cols
gc.collect()

(55935, 135)


21

In [17]:
cat_params_submit_fast = {
    "learning_rate" : 0.06,
    "eval_metric" : 'RMSE', 
    "max_depth" : 7,
    "verbose" : 100,
    "n_estimators" : 800,
    "task_type" : 'GPU'
    }
cat_params_submit_middle = {
    "learning_rate" : 0.04,
    "eval_metric" : 'RMSE', 
    "max_depth" : 7,
    "verbose" : 100,
    "n_estimators" : 600,
    "task_type" : 'GPU'
    }
cat_params_submit_slow = {
    "learning_rate" : 0.03,
    "eval_metric" : 'RMSE', 
    "max_depth" : 7,
    "verbose" : 100,
    "n_estimators" : 400,
    "task_type" : 'GPU'
    }

In [18]:
# Function to create  metadata features both for test and train.
# Note: here I cannot use "get_dummies" because one of the donors is only present in test set.
def add_metadata_features(d_frame):
    d_frame['svd_x_donor_13176'] = 0
    d_frame['svd_x_donor_31800'] = 0
    d_frame['svd_x_donor_32606'] = 0
    d_frame.loc[d_frame['donor'] == 13176, 'svd_x_donor_13176'] = 1
    d_frame.loc[d_frame['donor'] == 31800, 'svd_x_donor_31800'] = 1
    d_frame.loc[d_frame['donor'] == 32606, 'svd_x_donor_32606'] = 1
    d_frame['svd_x_day'] = d_frame['day']
    return d_frame

In [19]:
# Building catboost models and predicting target TruncatedSVD components in a cycle.
# Note that I use stronger parameters for the first TruncatedSVD components.
# For the last components I use fewer iterations and smaller learning rate to prevent overfitting.
df = add_metadata_features(df)
df_test = add_metadata_features(df_test)
x_cols = [col for col in list(df.columns) if (col.startswith('svd_x_'))]
y_cols = [col for col in list(df.columns) if (col.startswith('svd_y_'))]
X = df[x_cols].values
Y = df[y_cols].values
Xt = df_test[x_cols].values
for i in range(len(y_cols)):
    print('Training_column: ' + str(i))
    num = int(y_cols[i].rsplit('_', 1)[-1])
    #model = lightgbm.LGBMRegressor(**lightgbm_params)
    if num < 16:
        model = CatBoostRegressor(**cat_params_submit_fast)
    elif num < 32:
        model = CatBoostRegressor(**cat_params_submit_middle)
    else:
        model = CatBoostRegressor(**cat_params_submit_slow)
    model.fit(X, Y[:,i].copy())
    col_name = y_cols[i]
    df_test[col_name] = model.predict(Xt)

  after removing the cwd from sys.path.
  """
  
  # Remove the CWD from sys.path while we load stuff.


Training_column: 0
0:	learn: 36.4315814	total: 14.3ms	remaining: 11.4s
100:	learn: 23.3300270	total: 984ms	remaining: 6.81s
200:	learn: 22.2504778	total: 2.2s	remaining: 6.56s
300:	learn: 21.6549551	total: 3.08s	remaining: 5.11s
400:	learn: 21.2566258	total: 3.96s	remaining: 3.94s
500:	learn: 20.9489455	total: 4.84s	remaining: 2.89s
600:	learn: 20.6691061	total: 5.73s	remaining: 1.9s
700:	learn: 20.4284505	total: 6.61s	remaining: 933ms
799:	learn: 20.2037684	total: 7.49s	remaining: 0us
Training_column: 1
0:	learn: 26.8694724	total: 10ms	remaining: 7.99s
100:	learn: 10.4611484	total: 917ms	remaining: 6.34s
200:	learn: 9.8166523	total: 1.81s	remaining: 5.4s
300:	learn: 9.4855449	total: 3.17s	remaining: 5.26s
400:	learn: 9.2760127	total: 4.06s	remaining: 4.04s
500:	learn: 9.1176227	total: 4.96s	remaining: 2.96s
600:	learn: 8.9923715	total: 5.84s	remaining: 1.94s
700:	learn: 8.8728801	total: 6.71s	remaining: 948ms
799:	learn: 8.7714612	total: 7.57s	remaining: 0us
Training_column: 2
0:	lear



Training_column: 97
0:	learn: 3.7778498	total: 9.78ms	remaining: 3.9s
100:	learn: 3.3651731	total: 907ms	remaining: 2.69s
200:	learn: 3.2427383	total: 1.79s	remaining: 1.78s
300:	learn: 3.1807407	total: 2.67s	remaining: 880ms
399:	learn: 3.1420318	total: 3.63s	remaining: 0us
Training_column: 98
0:	learn: 3.7507629	total: 9.6ms	remaining: 3.83s
100:	learn: 3.2396355	total: 890ms	remaining: 2.63s
200:	learn: 3.1270647	total: 1.78s	remaining: 1.76s
300:	learn: 3.0712423	total: 2.78s	remaining: 913ms
399:	learn: 3.0363281	total: 3.93s	remaining: 0us
Training_column: 99
0:	learn: 3.6235259	total: 9.44ms	remaining: 3.77s
100:	learn: 3.2366755	total: 896ms	remaining: 2.65s
200:	learn: 3.1168645	total: 1.78s	remaining: 1.76s
300:	learn: 3.0510581	total: 2.66s	remaining: 875ms
399:	learn: 3.0083187	total: 3.53s	remaining: 0us
Training_column: 100
0:	learn: 3.5758742	total: 9.78ms	remaining: 3.9s
100:	learn: 3.0692435	total: 894ms	remaining: 2.65s
200:	learn: 2.9406774	total: 1.78s	remaining: 1.

In [20]:
del df, X, Y, Xt, model
for col in df_test.columns:
    if col in x_cols:
        del df_test[col]
gc.collect()

0

In [21]:
# Saving the final results.
df_test[y_cols].reset_index().to_feather('multiome_multi.ftr')