# Preprocessing

This tutorial helps you preprocessing your raw data so that you can run `Scyan` afterwards.

You'll learn how to:
- Create an `adata` object based on a FCS (or CSV) file and preprocess it.
- Create the knowledge table required for the annotation.
- (Optional) Compute a UMAP and save your dataset for later use.

Before continuing, make sure you have already [installed scyan](../../getting_started).

In [1]:
import scyan

Global seed set to 0


## 1. Creation of an `AnnData` object for your cytometry data

Consider reading the [anndata documentation](https://anndata.readthedocs.io/en/latest/) if you have never heard about `anndata` before (it's a nice library for handling single-cell data).

<div class="admonition note">
    <p class="admonition-title">Note</p>
    <p>
        Make sure you only take the population of interest. E.g., if you are interested into immune cells, consider providing only the live cells that are CD45+. If not possible, continue the tutorial, but consider running Scyan for filtering these cells before annotating the populations.
    </p>
</div>

### a) Loading a `FCS` or `CSV` file

You probably have `.fcs` or `.csv` files that you want to load. For this, you can use [`scyan.read_fcs`](../../api/io/#scyan.read_fcs) or [`scyan.read_csv`](../../api/io/#scyan.read_csv).

In [7]:
# If you have a FCS file
adata = scyan.read_fcs("<path-to-fcs>.fcs")

# If you have a CSV file
adata = scyan.read_csv("<path-to-csv>.csv")

print(f"Created anndata object with {adata.n_obs} cells and {adata.n_vars} markers.\n\n-> The markers names are: {', '.join(adata.var_names)}\n-> The non-marker names are: {', '.join(adata.obs.columns)}")

Created anndata object with 216331 cells and 42 markers.

-> The markers names are: epcam, CD4, CD38, CD1a, CD24, CD123, CD47, CD39, CD31, CD169, CCR7, CD44, CD141, CD1c, CD9, HLADQ, CD11b, CD103, CD3/16/9/20, CD366, PD1, CD21, CD127, GP38, CD14, CD45, CD206, CTLA4, CD207, CD223, PDL1, CD69, CD25, Siglec10, HLADR, FOLR2, CADM1, CD45RA, CD5, Via dye, CD88, CD8
-> The non-marker names are: Time, SSC-H, SSC-A, FSC-H, FSC-A, SSC-B-H, SSC-B-A, AF-A


### b) Sanity check

Make sure that the listed markers (i.e., `adata.var_names`) contains only protein markers, and that every other variable is inside `adata.obs`. If this is not the case, consider reading [`scyan.read_fcs`](../../api/io/#scyan.read_fcs) or [`scyan.read_csv`](../../api/io/#scyan.read_csv) for more advanced usage (e.g., you can update `marker_regex="^cd|^hla|epcam|^ccr"` to target all your markers).

### c) Concatenate your data (optional)

If you have multiple `FCS`, consider [concatenating your data](https://anndata.readthedocs.io/en/latest/generated/anndata.concat.html). We advise to add a observation column such as "batch" or "patient_id" to keep the information about the batch / patient ID.

<details class="tip">
    <summary>Click to show an example</summary>
    <p>This short script will concatenate all the FCS inside a specific folder, and save each file name into <code>adata.obs["file"]</code> so that we don't loose information. You can add additional information, e.g. in <code>adata.obs["batch"]</code> if you have different batches.</p>
    <div class="highlight"><pre><span></span><code><a id="__codelineno-6-1" name="__codelineno-6-1" href="#__codelineno-6-1"></a><span class="kn">import</span> <span class="nn">anndata</span>
<a id="__codelineno-6-2" name="__codelineno-6-2" href="#__codelineno-6-2"></a><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<a id="__codelineno-6-3" name="__codelineno-6-3" href="#__codelineno-6-3"></a>
<a id="__codelineno-6-4" name="__codelineno-6-4" href="#__codelineno-6-4"></a><span class="n">folder_path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s2">&quot;.&quot;</span><span class="p">)</span> <span class="c1"># Replace &quot;.&quot; by the path to your folder containing FCS files</span>
<a id="__codelineno-6-5" name="__codelineno-6-5" href="#__codelineno-6-5"></a><span class="n">fcs_paths</span> <span class="o">=</span> <span class="p">[</span><span class="n">path</span> <span class="k">for</span> <span class="n">path</span> <span class="ow">in</span> <span class="n">folder_path</span><span class="o">.</span><span class="n">iterdir</span><span class="p">()</span> <span class="k">if</span> <span class="n">path</span><span class="o">.</span><span class="n">suffix</span> <span class="o">==</span> <span class="s2">&quot;.fcs&quot;</span><span class="p">]</span>
<a id="__codelineno-6-6" name="__codelineno-6-6" href="#__codelineno-6-6"></a>
<a id="__codelineno-6-7" name="__codelineno-6-7" href="#__codelineno-6-7"></a><span class="k">def</span> <span class="nf">read_one</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
<a id="__codelineno-6-8" name="__codelineno-6-8" href="#__codelineno-6-8"></a>    <span class="n">adata</span> <span class="o">=</span> <span class="n">scyan</span><span class="o">.</span><span class="n">read_fcs</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
<a id="__codelineno-6-9" name="__codelineno-6-9" href="#__codelineno-6-9"></a>    <span class="n">adata</span><span class="o">.</span><span class="n">obs</span><span class="p">[</span><span class="s2">&quot;file&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">path</span><span class="o">.</span><span class="n">stem</span>
<a id="__codelineno-6-10" name="__codelineno-6-10" href="#__codelineno-6-10"></a>    <span class="n">adata</span><span class="o">.</span><span class="n">obs</span><span class="p">[</span><span class="s2">&quot;batch&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;NA&quot;</span> <span class="c1"># If you have batches, add here the batch of the corresponding path</span>
<a id="__codelineno-6-11" name="__codelineno-6-11" href="#__codelineno-6-11"></a>    <span class="k">return</span> <span class="n">adata</span>
<a id="__codelineno-6-12" name="__codelineno-6-12" href="#__codelineno-6-12"></a>
<a id="__codelineno-6-13" name="__codelineno-6-13" href="#__codelineno-6-13"></a><span class="n">adata</span> <span class="o">=</span> <span class="n">anndata</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">read_one</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">fcs_paths</span><span class="p">],</span> <span class="n">index_unique</span><span class="o">=</span><span class="s2">&quot;-&quot;</span><span class="p">)</span>
</code></pre></div>
</details>

### d) Preprocessing

Choose either the `asinh` or `logicle` transformation below, and scale your data.

In [4]:
is_cytof = True

if is_cytof: # we recommend asinh for CyTOF data
    scyan.preprocess.asinh_transform(adata)
else: # we recommend auto_logicle for flow or spectral flow
    scyan.preprocess.auto_logicle_transform(adata)

scyan.preprocess.scale(adata)

## 2. Creation of the knowledge table

<div class="admonition note">
    <p class="admonition-title">Note</p>
    <p>
        Some existing tables can be found <a href="https://github.com/MICS-Lab/scyan_data/blob/main/public_tables.md">here</a>. It could help you making your table.
    </p>
</div>

The knowledge table contains well-known marker expressions per population. For instance, if you want `Scyan` to annotate CD4 T cells, you have to tell which markers CD4 T cells are supposed to express or not. Depending on your panel, it may be CD4+, CD8-, CD45+, CD3+, etc. Values inside the table can be:

- `-1` for negative expressions.
- `1` for positive expressions.
- `NA` when you don't know or if it is not applicable (if you use a CSV, you can also let the field empty, it will be read as `NaN` by `pandas`).
- Some float values such as `0` or `-0.5` for mid and low expressions respectively (use it only when necessary).

Each row corresponds to one population, and each column corresponds to one marker (i.e., one of `adata.var_names`).

You can either directly create a `csv`, or use Excel and export the table as `csv`. Then, you can then import the `csv` to make a pandas `DataFrame`.

### Example

In [5]:
import pandas as pd

In [6]:
table = pd.read_csv("<path-to-csv>.csv", index_col=0)

In [7]:
table.head() # Display the first 5 rows of the table

Unnamed: 0_level_0,CD19,CD4,CD8,CD34,CD20,CD45,CD123,CD11c,CD7,CD16,CD38,CD3,HLA-DR,CD64
Populations,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Basophils,-1,,-1.0,-1,-1.0,,1,-1,-1.0,-1.0,,-1,-1.0,-1.0
CD4 T cells,-1,1.0,-1.0,-1,-1.0,,-1,-1,,-1.0,,1,-1.0,-1.0
CD8 T cells,-1,-1.0,1.0,-1,-1.0,,-1,-1,1.0,-1.0,,1,-1.0,-1.0
CD16- NK cells,-1,,,-1,-1.0,,-1,-1,1.0,-1.0,,-1,-1.0,-1.0
CD16+ NK cells,-1,,,-1,,,-1,-1,1.0,1.0,,-1,-1.0,-1.0


You can see our [advices](../../advice) when creating this table.

### Sanity check

Make sure `table.index` contains population names, and that `table.columns` contains existing marker names (i.e., included in `adata.var_names`).
> NB: the table index can be a `MultiIndex` to list hierarchical populations, and the first level should correspond to the most precise populations (see [how to work with hierarchical populations](../usage/#working-with-hierarchical-populations)).

## 3. (Optional) Compute a UMAP
You can compute the UMAP coordinates using [`scyan.tools.umap`](../../api/representation/#scyan.tools.umap). The API will guide you for the usage of this tool: especially, you can choose to compute the UMAP on a specific set of markers, or choose a subset of cells on which computing the UMAP (for acceleration).

> Note that it only computes the coordinates, then you'll have to use [`scyan.plot.umap`](../../api/plots/#scyan.plot.umap) to display it.

In [None]:
# Option 1: Use all markers to compute the UMAP
scyan.tools.umap(adata)

# Option 2: Use only the cell-type markers (recommended), or your choose your own list of markers
scyan.tools.umap(adata, markers=table.columns)

## 4. (Optional) Save your data for later use

You can use [scyan.data.add](../../api/datasets/#scyan.data.add) to save your data.

In [9]:
scyan.data.add("your-project-name", adata, table)

INFO:scyan.data.datasets:Creating new dataset folder at /.../your_project_name
INFO:scyan.data.datasets:Created file /.../your_project_name/default.h5ad
INFO:scyan.data.datasets:Created file /.../your_project_name/default.csv


From now on, you can now simply load your processed data with [scyan.data.load](../../api/datasets/#scyan.data.load):

In [10]:
adata, table = scyan.data.load("your-project-name")

## Next steps
Congratulations! You can now follow our tutorial on [model training and visualization](../../tutorials/usage).