In [1]:
import pandas as pd
import umap
import numpy as np

## Preprocessing

All celldyn data is loaded in, with all celldyn sapphire data in the UPOD-database

In [3]:
all_celldyn = pd.read_sas("L:/lab_research/RES-Folder-UPOD/Celldynclustering/E_ResearchData/2_ResearchData/celldyn.sas7bdat")

Select all the columns with blood values, and remove those rows where not all columns are filled. 
There are 80 variables in the data:
``` 
['c_b_wbc', 'c_b_wvf', 'c_b_neu', 'c_b_seg', 'c_b_bnd', 'c_b_ig', 'c_b_lym', 'c_b_lyme', 'c_b_vlym', 'c_b_mon', 'c_b_mone', 'c_b_blst', 'c_b_eos', 'c_b_bas', 'c_b_nrbc', 'c_b_pneu', 'c_b_pseg', 'c_b_pbnd', 'c_b_pig', 'c_b_plym', 'c_b_plyme', 'c_b_pvlym', 'c_b_pmon', 'c_b_pmone', 'c_b_pblst', 'c_b_peos',  'c_b_pbas', 'c_b_pnrbc', 'c_b_rbci', 'c_b_rbco', 'c_b_hgb_usa', 'c_b_mcv', 'c_b_rdw','c_b_pMIC', 'c_b_pMAC', 'c_b_mch_Usa', 'c_b_mchc_usa', 'c_b_ht', 'c_b_plto', 'c_b_plti', 'c_b_mpv', 'c_b_pct', 'c_b_pdw', 'c_b_retc', 'c_b_pretc', 'c_b_irf', 'c_b_pHPO', 'c_b_pHPR', 'c_b_HDW', 'c_b_MCVr', 'c_b_MCHr', 'c_b_MCHCr', 'c_b_prP', 'c_b_namn', 'c_b_nacv',  'c_b_nimn', 'c_b_nicv', 'c_b_npmn', 'c_b_npcv', 'c_b_ndmn', 'c_b_ndcv', 'c_b_nfmn', 'c_b_nfcv', 'c_b_Lamn', 'c_b_Lacv', 'c_b_Limn', 'c_b_Licv', 'c_b_Pimn', 'c_b_Picv', 'c_b_Ppmn',
 'c_b_Ppcv', 'c_b_rbcimn', 'c_b_rbcicv', 'c_b_rbcfmn', 'c_b_rbcfcv', 'c_b_rtcfmn', 'c_b_rtcfcv', 'c_b_hb', 'c_b_mch', 'c_b_mchc'] 
``` 

Future research should figure out if MICE or similar approach can be used, but for now, we only use complete cases. This model is greedy, in the sense that no preprocessing w.r.t. features is done

In [4]:
X = all_celldyn[[c for c in all_celldyn if c.startswith("c_b_")]]
X = X.dropna().reset_index(drop = True)

Scale all the variables to log-space

In [5]:
for c in X:
    X[c] = np.log(X[c])

Change floats64 to float32, because this is less memory intensive

In [6]:
X = X.astype(np.float32)

In [7]:
X.to_csv("../processed_celldyn.csv")

## Embedding

Load in the preprocessed data (see above)

In [2]:
embedding_data = pd.read_csv("../processed_celldyn.csv")

Initiate embedder: UMAP, with 6 components, 50 neighbours for manifold approximation, and 0 distance between embedded points, focusing on more global structures.

In [3]:
embedder = umap.UMAP(n_components=6,n_neighbors=50,min_dist=0,random_state=42)

In [4]:
len(embedding_data)

1991634

Delete the old index columns, because pd.read_csv gives weird errors if this column is assigned as index_col

In [5]:
embedding_data = embedding_data.drop(['Unnamed: 0'],axis = 1)

Again: float32, this could also be the only step where this is done

In [6]:
embedding_data = embedding_data.astype(np.float32)

Fit the embedder

In [7]:
fit = embedder.fit(embedding_data)



In [1]:
import joblib

Export the embedder

In [9]:
joblib.dump(embedder,"../umap_embedding_celldyn_joblib.joblib",)

['../umap_embedding_celldyn_joblib.joblib']

In [10]:
embedder

UMAP(dens_frac=0.0, dens_lambda=0.0, min_dist=0, n_components=6, n_neighbors=50,
     random_state=42)

See if reloading is working (it does!)

In [2]:
test_embedder = joblib.load('../umap_embedding_celldyn_joblib.joblib',)

In [3]:
test_embedder

UMAP(dens_frac=0.0, dens_lambda=0.0, min_dist=0, n_components=6, n_neighbors=50,
     random_state=42)

In [8]:
import pickle
pickle.dump(test_embedder,open("../umap_embedding_celldyn.pkl",'wb'),protocol=5)


In [9]:
!ipython nbconvert --to HTML create_embedding.ipynb

[NbConvertApp] Converting notebook create_embedding.ipynb to HTML
[NbConvertApp] Writing 577887 bytes to create_embedding.html


In [10]:
pickle.load(open("../umap_embedding_celldyn.pkl",'rb'))

UMAP(dens_frac=0.0, dens_lambda=0.0, min_dist=0, n_components=6, n_neighbors=50,
     random_state=42)