# Loading data into astir

## 0. Loading necessary libraries

In [21]:
# !pip install -e ../../..
%load_ext autoreload
%autoreload 2
import yaml
import pandas as pd
import astir as ast
import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Starting Astir within python

The input dataset should represent a collection of protein expression in single cells. The rows should represent the amount of various proteins expressed in one cell and the column represent the expression of one protein in different cells. A marker which maps the features (proteins) to cell type/state may also be required. A design matrix is optional. If provided, it should be either `np.array` or `pd.DataFrame`.

The initialization of `Astir` requires input dataset `input_expr` as one of `pd.DataFrame`, `Tuple[np.array, List[str], List[str]]` and `Tuple[SCDataset, SCDataset]`. 
- When the input is `pd.DataFrame`, its row and column should respectively represent the cells and the features (proteins). 
- When the input is `Tuple[np.array, List[str], List[str]]`, the first element `np.array` is the input dataset, the second element `List[str]` is the title of the columns (the names of proteins) and the third element `List[str]` is the title of the rows (the name　of the cells). The length of the second and third list should be equal to the number of columns and rows of the first array. 
- When the input is `Tuple[SCDataset, SCDataset]`, the first `SCDataset` should be the cell type dataset and the second `SCDataset` should be the cell state dataset.

The marker `marker_dict` is not required when `input_expr` is `Tuple[SCDataset, SCDataset]`. Otherwise, it is required to be `Dict[str, Dict[str, str]]`. The outer dictionary may have two keys: `cell_type` and `cell_state`. The two keys maps to the corresponding dictionary which maps the name of cell type/state to protein features. If the user is only intended to classify one of cell type and cell state, only the intended marker dictionary should be provided. So that marker_dict is one of `{"cell_state": {...}}`, `{"cell_type": {...}}` and `{"cell_type": {...}, "cell_state": {...}}`.

Here is some example:

In [19]:
expression_mat_path = "../../../astir/tests/test-data/test_data.csv"
yaml_marker_path = "../../../astir/tests/test-data/jackson-2020-markers.yml"
design_mat_path = "../../../astir/tests/test-data/design.csv"

First, a marker dict should be read from yaml file:

In [8]:
with open(yaml_marker_path, "r") as stream:
    marker_dict = yaml.safe_load(stream)

Second, the design matrix should be read from csv file:

In [17]:
design_df = pd.read_csv(design_mat_path, index_col=0)

Then if the user want to load the dataset as `pd.DataFrame`:

In [20]:
df_expr = pd.read_csv(expression_mat_path, index_col=0)
a = ast.Astir(input_expr=df_expr, marker_dict=marker_dict, design=design_df)

Or if the user want to load the dataset as `np.array`:

In [24]:
np_expr = df_expr.to_numpy()
features = list(df_expr.columns)
cores = list(df_expr.index)
a = ast.Astir(input_expr=(np_expr, features, cores), marker_dict=marker_dict, design=design_df)

Or if the user want to load the dataset as `SCDataset`:

In [None]:
scd = ast.SCDataset(expr_input=df_expr, marker_dict=marker_dict, design=design_df)

## 2. Loading from csv and yaml files

## 3. Loading from a directory of csvs and yaml

## 4. Loading from loom

## 5. Loading from anndata

We can read in data from the [AnnData](https://anndata.readthedocs.io/en/stable/anndata.AnnData.html) format, along with a `yaml` file containing marker information using the `from_anndata_yaml` function:

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join('../../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from astir.data import from_anndata_yaml
ast = from_anndata_yaml(anndata_file="../../../astir/tests/test-data/adata_small.h5ad", 
                        marker_yaml="../../../astir/tests/test-data/jackson-2020-markers.yml",
                        protein_name="protein",
                        cell_name="cell_name",
                        batch_name="batch")
print(ast)

Astir object with 6 cell types, 4 cell states, and 10 cells.


Some notes:

1. The protein and cell names are taken from `adata.var[protein_name]` and `adata.obs[cell_name]` respectively if specified, and `adata.var_names` and `adata.obs_names` otherwise.

2. If `batch_name` is sepecified, the corresponding column of `adata.var` will be assumed as the batch variable and turned into a design matrix.

In [3]:
type(ast.get_type_dataset().get_exprs())

torch.Tensor

In [4]:
ast.get_type_dataset().get_exprs_df()

Unnamed: 0,CD20,CD3,CD45,CD68,Cytokeratin 14,Cytokeratin 19,Cytokeratin 5,Cytokeratin 7,Cytokeratin 8/18,E-Cadherin,Fibronectin,Her2,Vimentin,pan Cytokeratin
ZTMA208_slide_11_By5x8_1,0.168521,0.090277,0.271871,0.412439,0.087354,0.15571,0.100308,0.0,0.096674,0.974271,2.86747,0.552905,2.335253,1.361075
ZTMA208_slide_11_By5x8_2,0.366301,0.352614,0.284034,0.312862,0.152354,0.508728,0.028651,0.029904,0.749755,2.78774,2.174494,1.046198,0.285699,2.454543
ZTMA208_slide_11_By5x8_3,0.177006,0.103808,0.150791,0.122472,0.292241,0.634366,0.090457,0.056627,0.446911,1.92794,2.997043,1.020517,2.887193,2.59046
ZTMA208_slide_11_By5x8_4,0.304068,0.222802,0.219736,0.277622,0.37387,2.212514,0.304824,0.0,1.904837,3.175959,1.598163,2.269974,0.877098,4.250308
ZTMA208_slide_11_By5x8_5,0.137789,0.13001,0.105604,1.03528,0.212105,0.144144,0.074692,0.0,0.0,1.900182,2.326346,0.610897,2.882146,0.275225
ZTMA208_slide_11_By5x8_6,0.182926,0.169596,0.270698,0.257178,0.224863,1.143546,0.1896,0.001542,0.650384,2.580153,1.891692,1.724237,1.931947,2.994441
ZTMA208_slide_11_By5x8_7,0.239257,0.149007,0.351788,0.13808,0.142505,1.415104,0.124484,0.001245,1.091975,2.696699,1.994174,1.796137,0.127125,3.523499
ZTMA208_slide_11_By5x8_8,0.175299,0.153332,0.215698,0.104709,0.237387,2.190369,0.2646,0.0,1.457901,2.788996,1.859896,1.726696,0.106661,4.245234
ZTMA208_slide_11_By5x8_9,0.210541,0.118273,0.146135,0.148164,0.362226,1.267224,0.173477,0.0,0.842407,2.95044,1.852758,2.183716,0.957369,3.098247
ZTMA208_slide_11_By5x8_10,0.308899,0.326121,0.224866,0.276182,0.14024,2.032473,0.334358,0.0,1.503531,2.93859,2.192502,2.312838,1.337983,4.199266
