# __Title:__ Jupyter notebook for pRESTO

#### __Author:__ Gildas Lepennetier 

<font color='red'> __WARNING__ All those steps are before even starting jupyter </font>

## Set up the servers

This step should be done even before the jupyter server can be started.

### Create keys to avoid passwords

On the local computer:

`ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub ga94rac2@lxlogin5.lrz.de
`

### Connect to the server

#### Exemple: LRZ linux cluster + SLURM

Replace by your own credential

`ssh ga94rac2@lxlogin5.lrz.de`


### Load the required modules to have anaconda (conda) available

`module load python`

`conda info -e`

    root                  *  /lrz/sys/intel/studio2019_u3/intelpython2

This will not be enough to actually use pRESTO. One have to create a new environment, using python 3.6
Then, we activate it, and install the required modules

`conda create --name pRESTO python=3.6`

`conda activate pRESTO`

`pip install --upgrade pip presto changeo snakemake jupyter`

Check that everything is properly created:

`conda info -e`

    pRESTO                   /home/hpc/tb601/ga94rac2/.conda/envs/pRESTO
    root                  *  /lrz/sys/intel/studio2019_u3/intelpython2


It is a good practice to regularely run the pip install --upgrade to be up-to-date 

### Start the environment with jupyter + required programs

<font color='red'> WARNING: The modules (conda, jupyter...) have to be activated before you can start the server</font>

To be able to use jupyter notebook, the proper modules have to be started

`conda activate pRESTO`

`conda info -e`

    pRESTO                *  /home/hpc/tb601/ga94rac2/.conda/envs/pRESTO
    root                     /lrz/sys/intel/studio2019_u3/intelpython2

To quit the session after, the command will be `source deactivate pRESTO`


## Use Git versioning

In the following directory

`cd ~/notebooks_jupyter`

### Start a git, <font color='red'>if not already present.</font>

First, go to github and create the directory. Then, execute:

`
module load git
git init
git remote add origin https://github.com/GildasLepennetier/notebooks_jupyter.git
echo ".ipynb_checkpoints/" > .gitignore
git add * .gitignore
git commit -am "first commit"
git push --set-upstream origin master
`

### Clone a git, <font color='red'>if already existing.</font>

`git clone https://github.com/GildasLepennetier/notebooks_jupyter.git`

### Update your git directory, <font color='green'>if already present.</font>

`git pull`

## Start the jupyter server


<font color='blue'> optional: change directory, go where the notebooks are stored</font>

`cd ~/notebooks_jupyter`

`jupyter notebook --port-retries=10 --no-browser --port=8888 small_tests.ipynb`

<font color='orange'> WARNING: If the port is already used, the program will try another one, in this case the following step have to take the new port into account</font>

Start a ssh tunneling (using the port announced) <font color='red'>on the local computer (not the server) </font> to be able to use your own browser

`ssh -v -N -L localhost:8888:localhost:8888 ga94rac2@lxlogin5.lrz.de`

## Check some variables:

In [7]:
%%bash
echo -e "local working directory:\t$(pwd)"
echo -e "host is:\t\t\t$(hostname -i)"


local working directory:	/home/hpc/tb601/ga94rac2/notebooks_jupyter
host is:			10.156.79.105


## Update the environment

It is a good idea to update systematically the programs used

In [8]:
!pip install --upgrade presto changeo | cut -c 1-100

Requirement already up-to-date: presto in /home/hpc/tb601/ga94rac2/.conda/envs/pRESTO/lib/python3.6/
Requirement already up-to-date: changeo in /home/hpc/tb601/ga94rac2/.conda/envs/pRESTO/lib/python3.6


## Get some scripts for the processing

`cd /home/hpc/tb601/ga94rac2/presto_cluster
git pull`


## PARAMETERS

It is a good practice to have files with all parameters for a job. A common was to do it is to have a file in: 

    text
    json
    python pickle
    R data

All options are retrived from those files and help to keep track of the data for each run. It is also possible to set the options in this script.


## Get the data

He have to copy the data to the working directory

### Send data to the server
### Sort the folder

In [20]:
%%bash
LOGIN_DATA_source="ga94rac@141.39.145.123:"
LOGIN_DATA_target="" #ga94rac2@lxlogin5.lrz.de: #no need since already connected there

RUN_ID="190308_M04284_0061_000000000-C9YC4"
echo "$RUN_ID" > $HOME/run.tmp

DATA_SOURCE_root="/media/ga94rac/BACKUP1/"
FASTA_source="$DATA_SOURCE_root/$RUN_ID/Data/Intensities/BaseCalls/"
WORKING_DIRECTORY_root="/gpfs/scratch/tb601/ga94rac2"
FASTA_target=$WORKING_DIRECTORY_root/$RUN_ID/

mkdir -p $FASTA_target

#the fasta files are on a specific place. -v verbose -c checksum
#I have to pass an extra argument for ssh on port != 22

rsync --info=NAME1 -e 'ssh -p 2702' $LOGIN_DATA_source$FASTA_source"/*.fastq.gz" $LOGIN_DATA_target$FASTA_target

rsync --info=NAME1 -e 'ssh -p 2702' $LOGIN_DATA_source/$DATA_SOURCE_root/$RUN_ID"/SampleSheet.csv" $LOGIN_DATA_target$FASTA_target

bash $HOME/presto_cluster/SCRIPTS/script_3_bashMe_v1_sort_run_dir.sh $HOME/run.tmp $WORKING_DIRECTORY_root


Separation of samples in directories
Moving into: /gpfs/scratch/tb601/ga94rac2/190308_M04284_0061_000000000-C9YC4
currently doing: SAMPLE_TAG = _S0_L
Undetermined ( *_S0_L* ) -- skipped
currently doing: SAMPLE_TAG = _S1_L
currently doing: SAMPLE_TAG = _S2_L
currently doing: SAMPLE_TAG = _S3_L
currently doing: SAMPLE_TAG = _S4_L
currently doing: SAMPLE_TAG = _S5_L
currently doing: SAMPLE_TAG = _S6_L
currently doing: SAMPLE_TAG = _S7_L
currently doing: SAMPLE_TAG = _S8_L
### list of all directories, save in /gpfs/scratch/tb601/ga94rac2/190308_M04284_0061_000000000-C9YC4/all_samples.txt
samples file = /gpfs/scratch/tb601/ga94rac2/190308_M04284_0061_000000000-C9YC4/all_samples.txt
processed:
BL6-LN_S4
BL6-PEG-SPL_S7
BL6-SPL_S3
BL-LN-PEG_S8
MP3_S1
MP4-PEG_S2
TH-LN_S6
TH-SPL_S5
end: Thu Apr 18 14:19:30 CEST 2019
