# Launch AGEAS with ageas.Launch()

This notebook demonstrate how to use ageas.Launch() in extracting key genomic features from Gene Expression Matrices(GEMs) containing RNA-seq based gene counts data of different sample groups.


In [None]:
import ageas


Currently, AGEAS support data under two different formats:

1. Dataframe under CSV or TXT format with rows representing genes and columns representing samples, which should looks like:



   |                 | SRR1039509 | SRR1039512 | SRR1039513 | SRR1039516 | SRR1039508 |
   |-----------------|------------|------------|------------|------------|------------|
   | ENSG00000000003 | 679        | 448        | 873        | 408        | 1138       |
   | ENSG00000000005 | 0          | 0          | 0          | 0          | 0          |
   | ENSG00000000419 | 467        | 515        | 621        | 365        | 587        |
   | ENSG00000000457 | 260        | 211        | 263        | 164        | 245        |
   | ENSG00000000460 | 60         | 55         | 40         | 35         | 78         |
   | ENSG00000000938 | 0          | 0          | 2          | 0          | 1          |

   

   Genes must either be named with Gene Symbols or Ensembl Gene IDs.

   There is no requirement for sample name type. Barcodes, numbers, any artificial names can work.


2. Market Exchange Format (MEX) output by [cellranger](https://github.com/10XGenomics/cellranger) pipeline. For more information: 
   https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices

## CSV example:

Extract key genomic factors to perform cell reprogramming from Mouse Embryonic Fibroblast(MEF) into Induced Pluripotent Stem Cell(iPSC), one of the most well known cell reprogramming case, with AGEAS.

Here, we can use scRNA-seq data published as [GSE103221](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103221).

Either raw data in GSE103221_RAW.tar or normalized counts in GSE103221_normalized_counts.csv.gz can be processed with AGEAS.

For using raw data with AGEAS default settings:

In [None]:
# This part will be very computational expensive!
# Feel free to skip to adjusted version below!

ageas.Launch(
	group1_path = 'GSE103221_RAW/GSM3629847_10x_osk_mef.csv.gz',
	group2_path = 'GSE103221_RAW/GSM3629848_10x_osk_esc.csv.gz',
)

Few adjustments can be made with ageas.Launch() pipeline like:

In [None]:
test_raw = ageas.Launch(
	mute_unit = True,
	protocol = 'multi',
	unit_num = 4,

	# ageas.Get_Pseudo_Samples() args
	group1_path = 'GSE103221_RAW/GSM3629847_10x_osk_mef.csv.gz',
	group2_path = 'GSE103221_RAW/GSM3629848_10x_osk_esc.csv.gz',

	# ageas.Unit() args
	std_value_thread = 3.0,
)

Herein, instead of using 2 AGEAS extractor units by default, 4 units are used with **_unit_num = 4_**. Extraction result should be more generalized.

To save running time, **_protocol = 'multi'_** set units to run parallelly with multithreading.

With **_mute_unit = True_**, same as default setting, no log will be printed out by any unit.

**_std_value_thread = 3.0_** can rule out genes with relatively low expression variability and, thus, limit amount of GRPs in meta level processed GRN and pseudo-sample GRNs.

For more API information, please visit [documentaion page](https://nkmtmsys.github.io/Ageas/html/generated/ageas.Launch.html#ageas.Launch).

Extraction reports can be saved as files with:

In [None]:
test_raw.save_reports(
	folder_path = 'report_files/',
	save_unit_reports = True
)

Within folder *report_files*. there should have following files:
```bash
report_files/
    │
    ├─ no_1/
    │  ├─ grps_importances.txt
    │  ├─ outlier_grps.js
    │
    ├─ no_2/
    │
    ├─ no_3/
    │
    ├─ no_4/
    │
    ├─ key_atlas.js
    │
    ├─ metaGRN.js
    │
    ├─ meta_report.csv
    │
    ├─ psGRNs.js
    │
    ├─ report.csv
```

Folders ***no_1***, ***no_2***, ***no_3***, ***no_4*** contain key Gene Regulatory Pathways(GRPs) extracted by each extractor unit as ***grps_importances.txt*** which has GRPs ranked with importance scores and ***outlier_grps.js*** which has GRPs once removed during feature selection due to extremly high importance score. If these information not needed, keep **_save_unit_reports_** as False by default.

***metaGRN.js*** contains meta-level processed Gene Regulatory Networks(GRN) cast with all data in each sample group.

***psGRNs.js*** contains GRNs cast with each pseudo-sample.

***key_atlas.js*** contains regulons cast with all of the key GRPs extracted and bridge GRPs which can connect separate regulons to form larger network.

***meta_report.csv*** is generated only with meta-GRN. Which should looks like:


| ID     | Gene Symbol | Type | Degree | Log2FC           |
|--------|-------------|------|--------|------------------|
| Pou5f1 | Pou5f1      | TF   | 786    | 18.0654266535883 |
| Trim28 | Trim28      | TF   | 727    | 16.7739633684336 |
| Trp53  | Trp53       | TF   | 725    | 15.9708922902521 |
| Rest   | Rest        | TF   | 695    | 15.0240141813129 |
| Sox2   | Sox2        | TF   | 687    | 15.4459524481466 |
| Junb   | Junb        | TF   | 683    | 14.1196706687477 |
| Cebpb  | Cebpb       | TF   | 682    | 13.8599229728618 |


***report.csv*** is generated with ***key_atlas.js***. Which should looks like:


| ID     | Gene Symbol | Regulon   | Type | Source_Num | Target_Num | Meta_Degree | Log2FC              |
|--------|-------------|-----------|------|------------|------------|--------------|--------------------|
| Pou5f1 | Pou5f1      | regulon_0 | TF   | 5          | 68         | 786          | 18.065426653588275 |
| Klf2   | Klf2        | regulon_0 | TF   | 0          | 71         | 592          | 17.77230812962195  |
| Trim28 | Trim28      | regulon_0 | TF   | 8          | 136        | 727          | 16.77396336843356  |
| Trp53  | Trp53       | regulon_0 | TF   | 7          | 133        | 724          | 15.97089229025213  |
| Sox2   | Sox2        | regulon_0 | TF   | 3          | 83         | 687          | 15.44595244814661  |
| Nanog  | Nanog       | regulon_0 | TF   | 4          | 81         | 656          | 15.417885470810674 |
| Klf4   | Klf4        | regulon_0 | TF   | 3          | 62         | 428          | 15.348589852448784 |


(I will add more explanation~)

In [None]:
import re
import pandas as pd

data = pd.read_csv('GSE103221_normalized_counts.csv', index_col = 0)

mef_samples = [x for x in data if re.search(r'mef', x)]
esc_samples = [x for x in data if re.search(r'esc', x)]

data[mef_samples].to_csv('mef.csv.gz')
data[esc_samples].to_csv('esc.csv.gz')

In [None]:
test_normalized = ageas.Launch(
	unit_num = 4,

	group1_path = 'mef.csv.gz',
	group2_path = 'esc.csv.gz',
	sliding_window_size = 10,
 	sliding_window_stride = 1,

	log2fc_thread = 3,
	std_value_thread = 100,
)

test_normalized.save_reports(
	folder_path = 'report_files/',
	save_unit_reports = True
)

Considering the sample scale of raw data for each group is few thousands, generating pseudo-samples, which abstracts gene expressions from several distinct samples as continuous expression data in order to calculate gene expression correlations, with every 100 samples by default setting is acceptable.

