# Prepare your data

This tutorial guides you from a FCS file to the creation of a `Scyan` object. You'll learn how to:
- Create an `adata` object based on a FCS file and preprocess it
- Create the knowledge table required for the annotation

Before continuing, make sure you have already [installed scyan](../../getting_started).

## 1. Creation of an `AnnData` object for your cytometry data

Consider reading the [anndata documentation](https://anndata.readthedocs.io/en/latest/) if you have never heard about `anndata` before (it's a nice library for handling single-cell data).

> NB: Make sure you only take the population of interest. E.g., if you are interested into immune cells, consider providing only the live cells that are CD45+. If not possible, continue the tutorial, but consider running Scyan for filtering these cells before annotating the populations.

### Loading a `FCS` file

You probably have `.fcs` files that you want to load. For this, you can use [`scyan.read_fcs`](../../api/io/#scyan.read_fcs).

> NB: if you have another data format, such as `csv`, it also works: you just have to make your own [anndata](https://anndata.readthedocs.io/en/latest/) object. We advise to place everything that is not a marker inside `adata.obs`, i.e. columns such as 'Time' or 'SubjectID'.

In [1]:
import scyan

Global seed set to 0


In [2]:
adata = scyan.read_fcs("<path-to-fcs>.fcs")

print(f"\Created anndata object:\n{adata}\nThe markers names are: {', '.join(adata.var_names)}")

On this example, we have $N = 52 981$ cells and $M = 38$ markers. Also, some data is stored in `adata.obs`: Time.

Make sure that `adata.var_names` contains only markers, and that every other column is inside `adata.obs`. If this is not the case, consider reading [`scyan.read_fcs`](../../api/io/#scyan.read_fcs) for more advanced usage.

### (Optional) Concatenate your data

If you have multiple `FCS`, consider [concatenating your data](https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.concatenate.html#anndata.AnnData.concatenate).

We advise to add a observation column such as "batch" or "patient_id" to keep the information about the batch / patient ID.

### Preprocess your data

Choose either the `asinh` or `logicle` transformation below, and scale your data.

In [4]:
is_cytof = True

if is_cytof: # we recommend asinh for CyTOF data
    scyan.tools.asinh_transform(adata)
else: # we recommend auto_logicle for flow or spectral flow
    scyan.tools.auto_logicle_transform(adata)

scyan.tools.scale(adata)

## 2. (Optional) Compute a UMAP
You can compute the UMAP coordinates using [`scyan.tools.umap`](../../api/representation/#scyan.tools.umap). The API will guide you for the usage of this tool: especially, you can choose to compute the UMAP on a specific set of markers, or choose a subset of cells on which computing the UMAP (for acceleration).

> Note that it only computes the coordinates, then you'll have to use [`scyan.plot.umap`](../../api/plots/#scyan.plot.umap) to display it.

In [None]:
scyan.tools.umap(adata)

## 3. Creation of the knowledge table
The knowledge table contains well-known marker expressions per population. For instance, if you want `Scyan` to annotate CD4 T cells, you have to tell which markers CD4 T cells are supposed to express or not. Depending on your panel, it may be CD4+, CD8-, CD45+, CD3+, etc. Values inside the table can be:

- `-1` for negative expressions.
- `1` for positive expressions.
- `NA` when you don't know or if it is not applicable (you can also let the field empty, it will be read as `NA`).
- Some float values such as `0` or `-0.5` for mid and low expressions respectively (use it only when necessary).

We recommend the `csv` format for this table. You can either directly create a `csv`, or use Excel and export the table as `csv`.

You can then import the `csv` to make a pandas `DataFrame`.

### Example

In [5]:
import pandas as pd

In [6]:
table = pd.read_csv("<path-to-csv>.csv", index_col=0)

In [7]:
table.head() # Display the first 5 rows of the table

Unnamed: 0_level_0,CD19,CD4,CD8,CD34,CD20,CD45,CD123,CD11c,CD7,CD16,CD38,CD3,HLA-DR,CD64
Populations,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Basophils,-1,,-1.0,-1,-1.0,,1,-1,-1.0,-1.0,,-1,-1.0,-1.0
CD4 T cells,-1,1.0,-1.0,-1,-1.0,,-1,-1,,-1.0,,1,-1.0,-1.0
CD8 T cells,-1,-1.0,1.0,-1,-1.0,,-1,-1,1.0,-1.0,,1,-1.0,-1.0
CD16- NK cells,-1,,,-1,-1.0,,-1,-1,1.0,-1.0,,-1,-1.0,-1.0
CD16+ NK cells,-1,,,-1,,,-1,-1,1.0,1.0,,-1,-1.0,-1.0


You can see our [advice](../../advanced/advice) when creating this table.

Also, ensure your column names correspond to marker names in `adata.var_names`.

## 5. (Optional) Save your data for later use

You can use [scyan.data.add](../../api/datasets/#scyan.data.add) to save your data.

In [9]:
scyan.data.add("your-project-name", adata, table)

INFO:scyan.data.datasets:Creating new dataset folder at /.../your_project_name
INFO:scyan.data.datasets:Created file /.../your_project_name/default.h5ad
INFO:scyan.data.datasets:Created file /.../your_project_name/default.csv


From now on, you can now simply load your processed data with [scyan.data.load](../../api/datasets/#scyan.data.load):

In [10]:
adata, table = scyan.data.load("your-project-name")

### Congratulations! You can now follow our tutorial on [model training and visualisation](../../tutorials/usage).