#### How to Download a New Dataset from the GEO Portal

1. Find the Dataset

    1.1 Go to the GEO website (https://www.ncbi.nlm.nih.gov/geo/) and locate the dataset you wish to download. 
        Example: GSE189849

    1.2 Find the matching metadata table:
        Click on the 'SRA Run Selector' button at the bottom left of the dataset page.

        This table links samples to treatment/groups and may include additional information such as age, sex, etc.
        
        Example metadata table location: /hdd1/projects/cd206/GSE189849/SraRunTable.csv


2. Create an info file for multiple sample downloads

    2.1 Create a new file containing the SRR IDs you wish to download. Set the column name to Run.

    2.2 Add another column specifying the group name for each sample. Set the column name to group.

    2.3 Save the file to the desired location.

    Example: /hdd1/projects/cd206/GSE189849/sample.txt

3. Download SRA files using Prefetch

    3.1 Download a single SRA file:


In [None]:
/home/elik/sratoolkit.3.0.0-centos_linux64/bin/prefetch SRR123456 &

    3.2 Download multiple SRA files:


In [None]:
tail -n +2 /PATH/TO/SRR/FILE/PRJNA12345.txt | while IFS=$'\t' read -r i _; do /home/elik/sratoolkit.3.0.0-centos_linux64/bin/prefetch "$i"; done &

    3.3 The downloaded SRA files will be located at: /hdd1/ncbi/sra/

4. Convert SRA to FASTQ Using Fastq-Dump

    4.1 Convert a single file:

In [None]:
/home/elik/sratoolkit.3.0.0-centos_linux64/bin/fastq-dump --split-files --gzip --outdir /PATH/TO/OUTPUT /hdd1/ncbi/sra/SRR12345.sra &

    4.2 Convert multiple files:

In [None]:
tail -n +2 /PATH/TO/SRR/FILE/PRJNA12345.txt | while IFS=$'\t' read -r i _; do /home/elik/sratoolkit.3.0.0-centos_linux64/bin/fastq-dump --split-files --gzip --outdir /PATH/TO/OUTPUT /hdd1/ncbi/sra/"$i".sra; done &

5. Run Salmon Quantification

    5.1 Change to the directory containing the FASTQ files:

In [None]:
cd /PATH/TO/OUTPUT

    5.2 Quantify reads (For mouse data, update the -i parameter to: /hdd1/genomes/salmon/salmon_index/mouse_index_gencodevM31):

    Single-end:

In [None]:
tail -n +2 /PATH/TO/SRR/FILE/PRJNA12345.txt | while IFS=$'\t' read -r i _; do \
/home/elik/anaconda3/bin/salmon quant \
-i /hdd1/genomes/salmon/salmon_index/hg38_index_gencodeV40 \
-l U -r "$i"_1.fastq.gz \
-o /PATH/TO/OUTPUT/"$i"; done &

    Paired-end:

In [None]:
tail -n +2 /PATH/TO/SRR/FILE/PRJNA12345.txt | while IFS=$'\t' read -r i _; do \
/home/elik/anaconda3/bin/salmon quant \
-i /hdd1/genomes/salmon/salmon_index/hg38_index_gencodeV40 \
-l IU -1 "$i"_1.fastq.gz -2 "$i"_2.fastq.gz \
-o /PATH/TO/OUTPUT/"$i"; done &

6. Analysis (use the script located at: /hdd1/projects/bulk_expression/code/DEA_R.ipynb):
   
   6.1 Create a summary table of TPM levels - Step 1.
   
    Output: A table with rows as gene names, columns as sample names, and values as TPM levels.

   6.2 Run differential expression analysis with DESeq2 - Step 2.

    Output: 
    
        a) Differential expression output files between control and each group.

        The relevant columns are log2FoldChange and pajd (FDR). 
        
        b) PCA plot


        