# 0. Possible preparations needed
- Install anaconda [Check here](https://docs.anaconda.com/free/anaconda/install/) [Not necessary]
    - Connect to jupyter

##### Code block:
```bash
export PATH="/$HOME/anaconda3/bin:$PATH"

export PATH="$PATH:$HOME/anaconda/bin"

cd

jupyter-notebook --no-browser --ip=0.0.0.0 --port=9999

```

- Install MAFFT [Check here](https://mafft.cbrc.jp/alignment/software/installation_without_root.html) [Important!!!]



**User's Manual**: 
- Enter the steps in !!**Code block**!! below separately
- Or just input:
```bash
cd /nfs/research/goldman/zihao/errorsProject_1/
sh Step_for_MAPLE.sh
```

Tips: But in order to find out the error in time, I strongly recommend following the first method.

***
***
# 1. Data Preparation

## 1.1 Data slicing (for easy download)

- **Idea**: cut all files into 100 copies so that they can be downloaded at the same time
- **Input**:
```python
input_files = ['/nfs/research/goldman/zihao/Datas/p1/File_5_Annot.txt',
               '/nfs/research/goldman/zihao/Datas/p1/File_5_Consensus.txt',
               '/nfs/research/goldman/zihao/Datas/p1/File_5_Coverage.txt']
```
- **Output**:
```python
output_directories = ['/nfs/research/goldman/zihao/Datas/p1/File_5_annot/Datas/batch*.txt',
                      '/nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Datas/batch*.txt',
                      '/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/Datas/batch*.txt']
```

##### Code block:
```bash
bsub -M 2000 
-e /nfs/research/goldman/zihao/errorsProject_1/1_Download/Data_slicing_errorChecking_error.txt  
'python3 /nfs/research/goldman/zihao/errorsProject_1/1_Download/Data_slicing.py'
```

***
## 1.2 Data Download

- **Idea**: Download the corresponding sample files from the files cut in the previous steps while following the FTP link


- **Input&Output**:
```python
for use_case in "File_5_annot" "File_5_consensus" "File_5_coverage"; do
      input_folder_path="$input_base_path/$use_case/Datas"
      output_folder_path="$output_base_path/$use_case/Downloads"
```

##### Code block:
```bash
cd /nfs/research/goldman/zihao/errorsProject_1/1_Download

sh run_data_download.sh
```

***
***
# 2. Data processing (adjusting sequence format)

1. Put all the downloaded consensus sequences in a single fasta file (let’s call it all_consensuses.fasta ).
    
2. Run mafft with the special options we mentioned before including --keeplength and using the reference I sent you MN908947.3 , let’s call the output all_consensuses_aligned.fasta .
    
3. Remove the reference sequence from the mafft alignment output (it should be the first sequence in the file), let’s call the resulting file all_consensuses_aligned_noReference.fasta .
    
4. Run my script createMapleFile.py on  all_consensuses_aligned_noReference.fasta WITHOUT using option --reference . This will create a MAPLE file for you, and a new reference to use with MAPLE.

***
## 2.1. Decompress

- **Idea**: Decompress the downloaded file
- **Input**: All files downloaded in the previous step

- **Output**: 
    - all_consensuses_batch_*.fasta

##### Code block:
```bash
bsub -M 2000 
-e /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Decompress_errorChecking_error.txt 
'python3 /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Decompress.py'
```

### 2.1.2 Sequence Alignment

- **Idea**: Alignment using MAFFT software and according to MN908947.3 as reference

```bash
#!/bin/bash
for i in {1..20}
do
  mafft --6merpair --keeplength --addfragments "/nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Decompress/all_consensuses_batch_$i.fasta" "/nfs/research/goldman/zihao/errorsProject_1/Consensuses/ref_MN908947.3.fasta" > "/nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Decompress/aligned_$i.fasta"
done
```

- **Input**: All files decompressed in the previous step

- **Output**: 
    - /nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Decompress/aligned_*.fasta

##### Code block:
```bash
cd /nfs/research/goldman/zihao/errorsProject_1/Consensuses
sh Aligned.sh
```

***
## 2.2. Del ref

- **Idea**: Delete reference MN908947.3

- **Input**: All files aligned in the previous step

- **Output**: 
    - all_consensuses_aligned_noReference_aligned_*.fasta

##### Code block:
```bash
bsub -M 2000
-e /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Del_ref_errorChecking_error.txt  
'python3 /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Del_ref.py'
```

***
## 2.3. Merge + Remove blank

- Input: all_consensuses_aligned_noReference_aligned_*.fasta

- Output: 
    - After merge: all_consensuses_aligned_noReference.fasta

    - After Remove: rm_blank_all_consensuses_aligned_noReference.fasta

##### Code block:
```bash
bsub -M 2000 
-e /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Merge_Remove_errorChecking_error.txt  
'python3 /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Merge_Remove_blank.py'
```

***
## 2.4. Transform as the MAPLE format

Run the script createMapleFile.py on  all_consensuses_aligned_noReference.fasta WITHOUT using option --reference . This will create a MAPLE file for you, and a new reference to use with MAPLE.

##### Code block:
```bash
bsub "/hps/software/users/goldman/pypy3/pypy3.7-v7.3.5-linux64/bin/pypy3 
createMapleFile.py 
--path /nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Decompress/Aligned/noReference/ 
--fasta rm_blank_all_consensuses_aligned_noReference.fasta 
--output MAPLE_format_consensuses.txt"
```

### Explanation of MAPLE format
#### > SRR20944325

|         |         |         |
|---------|---------|---------|
| - | 1 | 2 |
| n | 3 | 29901 |



- The second column number tells you the genome position the entry refers to.

- The third column
    - If the third column is not present, then the entry refers to only one position.
    - If the third column is present, its value tells you how many positions that entry refers to

- For example
|         |         |         |
|---------|---------|---------|
| n | 541 | 10 |

    - means that “n” is present in the sequence in ten consecutive positions from position 541 up to position 550.

***
***
# 3. Run MAPLE program

##### Code block:
```bash
bsub -g /MapleRealErrors -q long -M 40000 
-o /nfs/research/goldman/zihao/errorsProject_1/MAPLE/MAPLE_realData_errorChecking_output.txt 
-e /nfs/research/goldman/zihao/errorsProject_1/MAPLE/MAPLE_realData_errorChecking_error.txt 
/hps/software/users/goldman/pypy3/pypy3.7-v7.3.5-linux64/bin/pypy3 
/nfs/research/goldman/demaio/fastLK/code/MAPLEv0.3.2.py 
--model UNREST --rateVariation --estimateSiteSpecificErrorRate 
--input /nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Decompress/Aligned/noReference/MAPLE_format_consensuses_new_1.txt 
--overwrite --estimateErrors --calculateLKfinalTree 
--output /nfs/research/goldman/zihao/errorsProject_1/MAPLE/MAPLE0.3.1_rateVar_errors_realData_checkingErrors
```