## <center>Document the preparation of the first project and the use of MAPLE </center>
##### <center>MANUAL</center>



| **Label** | **start time** | **finish time** | **last modified** |
|:--------------:|:-----------:|:-----------:|:----------------:|
|   Project 1   |  2023-04-11 |  2023-04-20 |   2023-04-27     |

# 0. Possible preparations needed
- Install anaconda [Check here](https://docs.anaconda.com/free/anaconda/install/) [Not necessary]
    - Connect to jupyter

##### Code block:
```bash
export PATH="/$HOME/anaconda3/bin:$PATH"

export PATH="$PATH:$HOME/anaconda/bin"

cd

jupyter-notebook --no-browser --ip=0.0.0.0 --port=9999

```

- Install MAFFT [Check here](https://mafft.cbrc.jp/alignment/software/installation_without_root.html) [Important!!!]

**User's Manual**:  Start from 3. [Data Processing](#3.Data_Processing)
- Enter the steps in !!**Code block**!! below separately

Tips: In order to find out the error in time, I strongly recommend following this method.

# <center>Catalog:</center>
- 1. [Data Analysis](#1.Data_Analysis)

&nbsp;

- 2. [Data Preparation](#2.Data_Preparation)
    - [1) Picking_5_files](#2.1.Picking_5_files)
    - [2) Splitting_files_by_type](#2.2.Splitting_files_by_type)

&nbsp;

- 3. [Data Processing](#3.Data_Processing)
    - [1) Data slicing](#3.1.Data_slicing(for_easy_download))
    - [2) Data Download](#3.2.Data_Download)

&nbsp;

- 4. [Adjust sequence format](#4.Adjust_sequence_format)
    - [1) Decompress](#4.1.Decompress)
        - [i) Sequence Alignment](#4.1.1_Sequence_Alignment)
    
    - [2) Delete Reference](#4.2.Delete_Reference)
    - [3) Merge and Remove blank](#4.3.Merge_and_Remove_blank)
    - [4) Transform as the MAPLE format](#4.4.Transform_as_the_MAPLE_format)

&nbsp;

- 5. [Run MAPLE program](#5.Run_MAPLE_program)


***
***
# 1.Data_Analysis

### For this project, we have two datasets:
#### 1. analysis_run_list.txt
- Location:
```python 
/nfs/research/goldman/zihao/Datas/p1/Origin_Data/analysis_run_list.txt
```
- Function: show the number of files with the corresponding sample ID

#### 2. analysis_full_06.04.23.txt
- Location:
```python 
/nfs/research/goldman/zihao/Datas/p1/Origin_Data/analysis_full_06.04.23.txt
```
- Function: Provide download addresses for different types of data for each sample

- The example is shown below, but because this is a large dataset, only the first 100 are shown

For our task, we will need 
- coverage: Display the number of covered segments
- annot: Display annotations of gene samples at different locations, such as raw depth, mutations, etc.
- and consensus: Sequence of sample genes

but this is only for samples with five files, so we need to split (i.e. filter out samples with five files)

***
***
# 2.Data_Preparation
[Return to Catalog](#Catalog:)

###### Code:
Data_Preparation_5_files.ipynb

***
### 2.1.Picking_5_files
The processing idea is as follows:
- To read data from a data file row by row, you can use Pandas' read_csv() function and set the chunksize parameter to read the data chunk by chunk.
- For each chunk of data, iterate through each row and process it.
- For each row, compare the "run_id" in another file, if it exists, store the run_id and FTP_path of the row in a dictionary, and write the data in the dictionary to a text file.
- If it does not exist, skip the row of data directly.
- Delete the block of data and continue reading the next block of data.
- [!!!] But at the same time we found that some files have more than one FTP link(Points to the same file), which will cause duplicate downloads and waste download resources, so we do the de-duplicate process

###### Output's location:
```python 
/nfs/research/goldman/zihao/Datas/p1/Origin_Data/file_5_script.txt # Before de-duplication
/nfs/research/goldman/zihao/Datas/p1/Origin_Data/File_5.txt # After de-duplication
```

***
### 2.2.Splitting_files_by_type
###### Output's location:
```python 
/nfs/research/goldman/zihao/Datas/p1/File_5_Coverage.txt # For Coverage
/nfs/research/goldman/zihao/Datas/p1/File_5_Annot.txt # For Annot
/nfs/research/goldman/zihao/Datas/p1/File_5_Consensus.txt # For Consensus
```

***
***
# 3.Data_Processing
[Return to Catalog](#Catalog:)

***
### 3.1.Data_slicing(for_easy_download))

- **Idea**: cut all files into 100 copies so that they can be downloaded at the same time
- **Input**:
```python
input_files = ['/nfs/research/goldman/zihao/Datas/p1/File_5_Annot.txt',
               '/nfs/research/goldman/zihao/Datas/p1/File_5_Consensus.txt',
               '/nfs/research/goldman/zihao/Datas/p1/File_5_Coverage.txt']
```
- **Output**:
```python
output_directories = ['/nfs/research/goldman/zihao/Datas/p1/File_5_annot/Datas/batch*.txt',
                      '/nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Datas/batch*.txt',
                      '/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/Datas/batch*.txt']
```
- **Location**： 
```python
/nfs/research/goldman/zihao/errorsProject_1/1_Download/Data_slicing.py
```
##### Code block:
```bash
bsub -M 2000 
-e /nfs/research/goldman/zihao/errorsProject_1/1_Download/Data_slicing_errorChecking_error.txt  
'python3 /nfs/research/goldman/zihao/errorsProject_1/1_Download/Data_slicing.py'
```

***
### 3.2.Data_Download

- **Idea**: Download the corresponding sample files from the files cut in the previous steps while following the FTP link


- **Input&Output**:
```python
for use_case in "File_5_annot" "File_5_consensus" "File_5_coverage"; do
      input_folder_path="$input_base_path/$use_case/Datas"
      output_folder_path="$output_base_path/$use_case/Downloads"
```
- **Location**： 
```python
/nfs/research/goldman/zihao/errorsProject_1/1_Download/run_data_download.sh
/nfs/research/goldman/zihao/errorsProject_1/1_Download/D_file_5_script.py
```
##### Code block:
```bash
cd /nfs/research/goldman/zihao/errorsProject_1/1_Download

sh run_data_download.sh
```

***
***
# 4.Adjust_sequence_format
[Return to Catalog](#Catalog:)

1. Put all the downloaded consensus sequences in a single fasta file (let’s call it all_consensuses.fasta ).
    
2. Run mafft with the special options we mentioned before including --keeplength and using the reference I sent you MN908947.3 , let’s call the output all_consensuses_aligned.fasta .
    
3. Remove the reference sequence from the mafft alignment output (it should be the first sequence in the file), let’s call the resulting file all_consensuses_aligned_noReference.fasta .
    
4. Run my script createMapleFile.py on  all_consensuses_aligned_noReference.fasta WITHOUT using option --reference . This will create a MAPLE file for you, and a new reference to use with MAPLE.

***
## 4.1.Decompress

- **Idea**: Decompress the downloaded file
- **Input**: All files downloaded in the previous step

- **Output**: 
    - all_consensuses_batch_*.fasta
- **Location**： 
```python
/nfs/research/goldman/zihao/errorsProject_1/Consensuses/Decompress.py
```
##### Code block:
```bash
bsub -M 2000 
-e /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Decompress_errorChecking_error.txt 
'python3 /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Decompress.py'
```

### 4.1.1_Sequence_Alignment

- **Idea**: Alignment using MAFFT software and according to MN908947.3 as reference

```bash
#!/bin/bash
for i in {1..20}
do
  mafft --6merpair --keeplength --addfragments "/nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Decompress/all_consensuses_batch_$i.fasta" "/nfs/research/goldman/zihao/errorsProject_1/Consensuses/ref_MN908947.3.fasta" > "/nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Decompress/aligned_$i.fasta"
done
```

- **Input**: All files decompressed in the previous step

- **Output**: 
    - /nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Decompress/aligned/aligned_*.fasta
- **Location**： 
```python
/nfs/research/goldman/zihao/errorsProject_1/Consensuses/Aligned.sh
``` 
##### Code block:
```bash
cd /nfs/research/goldman/zihao/errorsProject_1/Consensuses
sh Aligned.sh
```

***
## 4.2.Delete_Reference

- **Idea**: Delete reference MN908947.3

- **Input**: All files aligned in the previous step

- **Output**: 
    - all_consensuses_aligned_noReference_aligned_*.fasta
- **Location**： 
```python
/nfs/research/goldman/zihao/errorsProject_1/Consensuses/Del_ref.py
```
##### Code block:
```bash
bsub -M 2000
-e /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Del_ref_errorChecking_error.txt  
'python3 /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Del_ref.py'
```

***
## 4.3.Merge_and_Remove_blank

- Input: all_consensuses_aligned_noReference_aligned_*.fasta

- Output: 
    - After merge: all_consensuses_aligned_noReference.fasta

    - After Remove: rm_blank_all_consensuses_aligned_noReference.fasta
- **Location**： 
```python
/nfs/research/goldman/zihao/errorsProject_1/Consensuses/Merge_Remove_blank.py
```
##### Code block:
```bash
bsub -M 2000 
-e /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Merge_Remove_errorChecking_error.txt  
'python3 /nfs/research/goldman/zihao/errorsProject_1/Consensuses/Merge_Remove_blank.py'
```

***
## 4.4.Transform_as_the_MAPLE_format

Run the script createMapleFile.py on  all_consensuses_aligned_noReference.fasta WITHOUT using option --reference . This will create a MAPLE file for you, and a new reference to use with MAPLE.

##### Code block:
```bash
bsub "/hps/software/users/goldman/pypy3/pypy3.7-v7.3.5-linux64/bin/pypy3 
createMapleFile.py 
--path /nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Decompress/Aligned/noReference/ 
--fasta rm_blank_all_consensuses_aligned_noReference.fasta 
--output MAPLE_format_consensuses.txt"
```

### Explanation of MAPLE format
#### > SRR20944325

|         |         |         |
|---------|---------|---------|
| - | 1 | 2 |
| n | 3 | 29901 |



- The second column number tells you the genome position the entry refers to.

- The third column
    - If the third column is not present, then the entry refers to only one position.
    - If the third column is present, its value tells you how many positions that entry refers to

- For example
|         |         |         |
|---------|---------|---------|
| n | 541 | 10 |

    - means that “n” is present in the sequence in ten consecutive positions from position 541 up to position 550.

***
***
# 5.Run_MAPLE_program
[Return to Catalog](#Catalog:)

##### Code block:
```bash
bsub -g /MapleRealErrors_3 -q long -M 40000 
-o /nfs/research/goldman/zihao/errorsProject_1/MAPLE/new_version_MAY/MAPLE_realData_errorChecking_output_new_all.txt 
-e /nfs/research/goldman/zihao/errorsProject_1/MAPLE/new_version_MAY/MAPLE_realData_errorChecking_error_new_all.txt 
/hps/software/users/goldman/pypy3/pypy3.7-v7.3.5-linux64/bin/pypy3 /nfs/research/goldman/demaio/fastLK/code/MAPLEv0.3.2.py 
--model UNREST --rateVariation --estimateSiteSpecificErrorRate 
--input /nfs/research/goldman/zihao/Datas/p1/File_5_consensus/Decompress/Aligned/noReference/A_NEW_MAPLE_format/MAPLE_format_consensuses_new_1.txt 
--overwrite --estimateErrors --calculateLKfinalTree 
--output /nfs/research/goldman/zihao/errorsProject_1/MAPLE/new_version_MAY/MAPLE0.3.2_rateVar_errors_realData_checkingErrors_new_all 
```