# Data Downloading



## 0. Install the omics_toolbox conda environment

If you have not already set up the `omics_toolbox` environment, you can create it using the provided `environment.yml` file. Use the following command in your terminal:

`conda env create -f environment.yml`

## 1. Place your files into the ```data/raw``` directory

Your data can be in multiple formats, the "classic" one being the Gene-Barcode matrices .tsv files.

However, notebook 1-data-preprocessing explains how to preprocess other types of data. You basically just need to have a matrix.

If you don't have a dataset to try running the algorithm on, you can use the following section to download the Embryoid Dataset from Mendeley.

The next notebook assumes that your data will be on the:
```bash 
data/raw/{{DATA_NAME}}
```
Running the tree command will show you the file structure.

In [6]:
# Verify your file structure is correct
!tree ../../data/raw

[1;36m../../data/raw[0m
├── [1;36mcyclicEMT[0m
│   ├── DN48_Output_Table_20241008.parquet
│   ├── [1;36minterest_columns[0m
│   │   ├── mean_intensity_columns.txt
│   │   ├── median_intensity_columns.txt
│   │   ├── std_intensity_columns.txt
│   │   └── upperquartile_intensity_columns.txt
│   └── interest_columns.zip
├── [1;36mgerm_cells[0m
│   ├── [1;36m2nd_vc2010[0m
│   │   ├── barcodes.tsv
│   │   ├── features.tsv
│   │   └── matrix.mtx
│   └── [1;36mvc2010_1[0m
│       ├── barcodes.tsv
│       ├── features.tsv
│       └── matrix.mtx
└── [1;36mscRNAseq[0m
    ├── [1;36mT0_1A[0m
    │   ├── barcodes.tsv
    │   ├── genes.tsv
    │   └── matrix.mtx
    ├── [1;36mT2_3B[0m
    │   ├── barcodes.tsv
    │   ├── genes.tsv
    │   └── matrix.mtx
    ├── [1;36mT4_5C[0m
    │   ├── barcodes.tsv
    │   ├── genes.tsv
    │   └── matrix.mtx
    ├── [1;36mT6_7D[0m
    │   ├── barcodes.tsv
    │   ├── genes.tsv
    │   └── matrix.mtx
    └── [1;36mT8_9E[0m
        ├── barc

<a id='loading'></a>
## 2. [Optional] Downloading 10X data

### Downloading Data from Mendeley Datasets

The EB dataset is publically available as `scRNAseq.zip` at Mendelay Datasets at <https://data.mendeley.com/datasets/v6n743h5ng/>. 

Inside the scRNAseq folder, there are five subdirectories, and in each subdirectory are three files: `barcodes.tsv`, `genes.tsv`, and `matrix.mtx`. For more information about how CellRanger produces these files, check out the [Gene-Barcode Matrices Documentation](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices).

Make sure to download the files and save them under the `omics_toolbox/data/raw` directory.

Here's the directory structure:
```
data
└── raw
    └── scRNAseq
        ├── scRNAseq.zip
        ├── T0_1A
        │   ├── barcodes.tsv
        │   ├── genes.tsv
        │   └── matrix.mtx
        ├── T2_3B
        │   ├── barcodes.tsv
        │   ├── genes.tsv
        │   └── matrix.mtx
        ├── T4_5C
        │   ├── barcodes.tsv
        │   ├── genes.tsv
        │   └── matrix.mtx
        ├── T6_7D
        │   ├── barcodes.tsv
        │   ├── genes.tsv
        │   └── matrix.mtx
        └── T8_9E
            ├── barcodes.tsv
            ├── genes.tsv
            └── matrix.mtx
└── processed
└── interim
└── external
```



In [None]:
import os
RAW_DATA_DIR = os.path.join('../../data', 'raw')

print(RAW_DATA_DIR)

../../data/raw


You can verify that the structure matches the above inside the ```data/raw```

In [4]:
!tree ../../data/raw

[1;36m../../data/raw[0m
├── [1;36mcyclicEMT[0m
│   ├── DN48_Output_Table_20241008.parquet
│   ├── [1;36minterest_columns[0m
│   │   ├── mean_intensity_columns.txt
│   │   ├── median_intensity_columns.txt
│   │   ├── std_intensity_columns.txt
│   │   └── upperquartile_intensity_columns.txt
│   └── interest_columns.zip
├── [1;36mgerm_cells[0m
│   ├── [1;36m2nd_vc2010[0m
│   │   ├── barcodes.tsv
│   │   ├── features.tsv
│   │   └── matrix.mtx
│   └── [1;36mvc2010_1[0m
│       ├── barcodes.tsv
│       ├── features.tsv
│       └── matrix.mtx
└── [1;36mscRNAseq[0m
    ├── [1;36mT0_1A[0m
    │   ├── barcodes.tsv
    │   ├── genes.tsv
    │   └── matrix.mtx
    ├── [1;36mT2_3B[0m
    │   ├── barcodes.tsv
    │   ├── genes.tsv
    │   └── matrix.mtx
    ├── [1;36mT4_5C[0m
    │   ├── barcodes.tsv
    │   ├── genes.tsv
    │   └── matrix.mtx
    ├── [1;36mT6_7D[0m
    │   ├── barcodes.tsv
    │   ├── genes.tsv
    │   └── matrix.mtx
    └── [1;36mT8_9E[0m
        ├── barc