## <center>Document the Data processing and visualization for Converage & Annot</center>
##### <center>MANUAL</center>



| **Label** | **start time** | **finish time** | **last modified** |
|:--------------:|:-----------:|:-----------:|:----------------:|
|   Project 1   |  2023-04-20 |  2023-04-24 |   2023-05-16     |

# <center>Catalog:</center>
- 1. [Process output](#1.Process_MAPLE_output)

&nbsp;

- 2. [Data processing for Converage](#2.Data_processing_for_Converage)
    - [1) Decompress and save coverage](#1.Decompress_and_save_coverage)
    - [2.1) Processing for error positions](#2.1_Processing_for_error_pos_of_Cov)
    - [2.2) Processing for all positions](#2.2_Processing_for_all_pos_of_Cov)
        - [2.2.1) Sampling for all positions](#2.2.1_Sampling_of_all_positions_cov)

&nbsp;

- 3. [Data processing for Annot](#3.Data_processing_for_Annot)
    - [1) Decompress and save coverage](#1.Decompress_and_save_Annot)
    - [2.1) Processing for error positions](#2.1_Processing_for_error_pos_of_Annot)
    - [2.2) Processing for all positions](#2.2_Processing_for_all_pos_of_Annot)
        - [2.2.1) Sampling for all positions](#2.2.1_Sampling_of_all_positions)

&nbsp;

- 4. [Visualization](#4.Visualization)
    - [1) Density Chart](#1.Density_chart_(python))
    - [2) Venn Chart](#2.Venn_chart_(R))

---
# 1.Process_MAPLE_output
[Return to Catalog](#Catalog:)

- **Idea**: Collate the output format and remove those with an error rate less than 0.5
- **Input**: MapleRealErrorsVariation_errorEstimation_estimatedErrors.txt
- **Output**: output_modified.txt
- **Address**:
```python
/nfs/research/goldman/zihao/errorsProject_1/Handle_with_output.py
```

### Code block:

###### 1) Modify input path
```bash
sed -i 's|file_path = "/nfs/research/goldman/zihao/errorsProject_1/MAPLE/TEST_50000/MAPLE0.3.2_rateVar_errors_realData_checkingErrors_50000_estimatedErrors.txt"|file_path = "/new_path/to/estimatedErrors.txt"|g' /nfs/research/goldman/zihao/errorsProject_1/Handle_with_output.py
```
###### 2) Modify output path
```bash
sed -i 's|output_folder = "/nfs/research/goldman/zihao/errorsProject_1/MAPLE/TEST_50000/"|output_folder = "/new_path/to/output_folder/"|g' /nfs/research/goldman/zihao/errorsProject_1/Handle_with_output.py
```
###### 3) Run the program
```bash
bsub -M 2000 
-e /nfs/research/goldman/zihao/errorsProject_1/Handle_with_output_errorChecking_error.txt 
'python3 /nfs/research/goldman/zihao/errorsProject_1/Handle_with_output.py'
```

# 2.Data_processing_for_Converage
[Return to Catalog](#Catalog:)

***
### 1.Decompress_and_save_coverage

- **Idea**: 
    - I have defined a class called CoverageProcessor that takes two parameters: the input directory and the output directory.

    - This class has two methods: get_file_paths and process_file_2. The former scans the input directory for all files ending in .coverage.gz and returns the paths to those files. The latter iterates over these files and calls the process_file method for each file.

    - The process_file method processes a given .coverage.gz file, calculates the average coverage and coverage ratio for each location (calculate the **RATIO**[NB_COVERAGE/MEAN_nb_coverage]), and then writes the result to a text file.

    - If the output file already exists, the process_file_2 method skips that file and continues to process the next one.

    - Note that when processing a file, if any exception occurs, the file will be skipped and the program will continue to process the next file. Also, when handling an exception the program will print an error message so that we can know what went wrong.
- **Input**: Downloaded files
- **Output**:
```python
output_directories = ['/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/Decompress/*_coverage.txt']
```
- **Address**:
```python
/nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_Decompress_and_save.py
```

##### Code block:
Tips: All samples have now been processed and do not need to be repeated!
```bash
bsub -M 2000 
-e /nfs/research/goldman/zihao/errorsProject_1/Coverage/Decompress_errorChecking_error.txt 
'python3 /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_Decompress_and_save.py'
```

***
### 2.1_Processing_for_error_pos_of_Cov

- **Idea**: 
    - I defined a class named DataProcessor which has three properties: data_file_path indicates the path to the data file, directory_to_search indicates the directory to search, and output_file_path indicates the path to the output file.

    - The code ‘search_position_value_and_get_column_4’ is a static method that searches for a value at a specific position in a specified file and gets the data in the fourth column. It opens the file, reads the file header, finds the index of the "Position" column, and then iterates through each row of the file. If it finds a line that matches the given position, it returns the value of the fourth column of that line. If no matching row is found, then ’None‘ is returned.

    - ‘gather_error_by_position’ is another static method that collects error information for each position from the given data. It iterates through each row of the data, extracts the position and error value, and adds it to the error_by_position list. Only if the error value is not equal to "None", it will be added to the list. Finally, it returns a list of error messages.

    - The ‘process_data method’ is the main data processing logic. It opens the data file, reads the header, and finds the first 10 lines of data in the file. Then, it iterates through these rows, extracts the ID and location information, and adds them to the data list. Next, for each line of data, it searches for a file in the specified directory that matches based on the ID and location information. If a matching file is found, the search_position_value_and_get_column_4 method is called to get the value of the fourth column and store it in the last column of the data list.

    - After the data is processed, the code removes the rows in the data list that contain "None" or None. It then calls the gather_error_by_position method to collect the error information for each position and writes the results to the output file.
 
- **Input**: The file with the position of the error after processing in the [previous step](#1.Process_MAPLE_output)
- **Output**:
```python
output_directories = ["/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/Error_pos_for_coverage.txt"]
```
- **Address**:
```python
/nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_Treat_error_pos.py
```

##### Code block:
###### 1) Modify input path ([previous step's](#1.Process_MAPLE_output)  output)
```bash
sed -i 's|data_file_path = "/nfs/research/goldman/zihao/errorsProject_1/MAPLE/TEST_50000/output_modified.txt"|data_file_path = "/new_path/of/output_folder/"|g' /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_Treat_error_pos.py
```
###### 2) Run the program
```bash
bsub -M 2000 -e /nfs/research/goldman/zihao/errorsProject_1/Coverage/Treat_error_pos_errorChecking_error.txt 'python3 /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_Treat_error_pos.py'
```

***
### 2.2_Processing_for_all_pos_of_Cov

- **Idea**: 
   - In this step I define two functions.

   - Firstly, the check_files_with_id function checks the files in the given folder whose filenames contain the IDs specified in the checkid_file. It copies the matching files to the output_folder. It starts by creating an empty set called id_set to store the IDs from the checkid_file. Then, it opens the checkid_file, reads its contents line by line, and adds the IDs (excluding the '>' character) to the id_set. Next, it checks if the output_folder exists and creates it if it doesn't. After that, it iterates over the filenames in the folder_path. If any ID string from id_set is found in the filename, it copies that file from the folder_path to the output_folder.

   - Secondly, the process_files function integrates and merges all the data in the input_folder into a single text file. It creates a new file named "Cov_RATIO.txt" in the output_folder. Then, it uses a CSV writer object with a tab delimiter to write the column names ("ID", "Position", "Ratio") to the output file. After that, it loops through the file names in the input_folder. For each file with the ".txt" extension, it extracts the file ID from the filename and opens the file. It skips the header line and reads the Position and Ratio columns. Then, it iterates over each line of the file, extracts the Position and Ratio values, and writes them along with the file ID to the output file.

- **Input**: MapleRealErrorsVariation_errorEstimation_estimatedErrors.txt
- **Output**:
```python
output_directories = ["/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/Cov_RATIO.txt"]
```
- **Address**:
```python
/nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_Treat_all_pos.py
```

##### Code block:
###### 1) To exclude IDs that do not match the coverage data, the set of checks needs to be modified
```bash
sed -i 's|checkid_file = "/nfs/research/goldman/zihao/errorsProject_1/MAPLE/TEST_50000/MAPLE0.3.2_rateVar_errors_realData_checkingErrors_50000_estimatedErrors.txt"|checkid_file = "/new/path/to/estimatedErrors.txt"|g' /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_Treat_all_pos.py
```
###### 2) Modify the path of the folder where all files corresponding to the current ID are stored
```bash
sed -i 's|middle_output_folder = "/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/PLOT_FOR_Coverage/"|middle_output_folder = "/new/path/to/new_folder"|g' /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_Treat_all_pos.py
```

###### 3) Run the program

```bash
bsub -M 2000 -e /nfs/research/goldman/zihao/errorsProject_1/Coverage/Treat_all_pos_errorChecking_error.txt 'python3 /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_Treat_all_pos.py'
```

#### 2.2.1_Sampling_of_all_positions_cov


- **Idea**: 
   We need to randomly sample five samples per location from all possible IDs, and the amount of data is so large that it cannot be loaded into memory at once. In this case, we can use the "Reservoir Sampling" algorithm, which is an algorithm for random sampling of large data streams with limited memory.

- **Input**: Full position after processing ([previous step's](#2.2_Processing_for_all_pos_of_Cov)  output)
- **Output**:
```python
output_directories = ["/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/cov_sampling_all.txt"]
```
- **Address**:
```python
/nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_sampling_all_pos.py
```
#### Reservoir Sampling Algorithm
Reservoir Sampling is a randomized algorithm used to select a random sample of `k` items from a stream or a large dataset of unknown size. The algorithm ensures that each item in the stream has an equal probability of being selected for the sample, regardless of the size of the dataset. Reservoir Sampling is particularly useful when the dataset is too large to fit into memory or when the size is unknown in advance.

The algorithm works as follows:

1. Initialize an empty reservoir of size `k` to store the sampled items.
2. Read the stream or dataset one item at a time.
3. For the first `k` items, simply add them to the reservoir.
4. For the `i`th item (where `i > k`), generate a random number `j` between `1` and `i` (inclusive).
   - If `j <= k`, replace the `j`th item in the reservoir with the `i`th item.
   - Otherwise, ignore the `i`th item and continue to the next item.
5. Repeat steps 4-5 until all items in the stream or dataset have been processed.
6. The final reservoir contains a random sample of `k` items from the stream or dataset.

The Reservoir Sampling algorithm ensures that each item in the stream has an equal probability of being selected for the sample. The probability of any specific item being in the final reservoir is `k` divided by the total number of items processed.

##### Code block:
```bash
bsub -M 2000 -e /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_sampling_errorChecking_error.txt 'python3 /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_sampling_all_pos.py'
```

***
***
# 3.Data_processing_for_Annot
[Return to Catalog](#Catalog:)

***
### 1.Decompress_and_save_Annot

- **Idea**: 
    - To process VCF files, I created a Python class called VCFProcessor. This class contains two attributes that represent the input and output directories, respectively: input_dir and output_dir.

        - I define the static function parse_info_field, which parses the VCF file's INFO field and converts it to a Python dictionary.

        - I also defined the process_vcf_file method, which processes a single VCF file and returns the processed data. Using Python's csv module, I first read the file and parse it. The parse_info_field method is then used to parse the INFO fields and retrieve the essential information such as chromosome position, reference and substitution bases, mutation frequency, and so on. Following that, I filter out INDEL, convert the mutation frequency to a decimal number, and check to see if it is larger than 0.5. If this is the case, I change it to 1 minus its value. Finally, I return the processed data to a dictionary.

        - I also define the process_files method, which will process all of the VCF files in the input directory. In this function, I utilise the glob module to retrieve all of the file paths that satisfy the criteria and then use the process_vcf_file method on each file individually to process it. If the processing is successful, the result is saved to the output directory as a TXT file.

    - I define a variable length_of_sample in the main function to indicate the length of the sample. Then I define the paths to the input and output directories and construct a VCFProcessor object. Finally, I use the process_files method to go through all of the VCF files in the input directory.





- **Input**: Downloaded files
- **Output**:
```python
output_directories = ['/nfs/research/goldman/zihao/Datas/p1/File_5_annot/Decompress/*_annot.txt']
```
- **Address**:
```python
/nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_Decompress_and_save.py
```

##### Code block:
Tips: All samples have now been processed and do not need to be repeated!
```bash
bsub -M 2000 
-e /nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_decompress_errorChecking_error.txt 
'python3 /nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_Decompress_and_save.py'
```

***
### 2.1_Processing_for_error_pos_of_Annot

- **Idea**: 
    - I wrote a data processing class, DataProcessor, which has three input parameters, data1_file, data_folder and output_file. data1_file is a file containing the data to be processed. data_folder is the folder containing the other files to be looked up. output_file is the path to the file where the processed data will be output.

        - In the DataProcessor class, I define three functions. read_data1 reads data from the data1_file file and returns the header of the data and the data itself. process_data processes each line of data in data1, adding two columns of data, AF and SB, by looking in the data_folder for the corresponding file. If the corresponding file is not found, the two columns are left blank. Finally, the write_output function writes the processed data to the output_file file.

        - In the main function, I specify three input parameters and create an instance of the DataProcessor class. Then, I call the write_output function to process the data and output the result.

    - The purpose of this class is to match and process the data in the data1_file file with the rest of the data in the data_folder and finally output the processed data to the output_file file.
- **Input**: The file with the position of the error after processing in the [previous step](#1.Process_MAPLE_output)
- **Output**:
```python
output_directories = ["/nfs/research/goldman/zihao/Datas/p1/File_5_annot/Error_pos_for_annot.txt"]
```
- **Address**:
```python
/nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_Treat_error_pos.py
```

##### Code block:
###### 1) Modify input path ([previous step's](#1.Process_MAPLE_output)  output)
```bash
sed -i 's|data1_file = "/nfs/research/goldman/zihao/errorsProject_1/MAPLE/TEST_50000/output_modified.txt"|data1_file = "/new_path/of/output_folder/"|g' /nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_Treat_error_pos.py
```
###### 2) Run the program
```bash
bsub -M 2000 -e /nfs/research/goldman/zihao/errorsProject_1/Annot/Treat_error_pos_errorChecking_error.txt 'python3 /nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_Treat_error_pos.py'
```

***
### 2.2_Processing_for_all_pos_of_Annot

- **Idea**: 
   - I have created two functions to process files. The first function, check_files_with_id, checks if the filenames in a given folder contain specific IDs listed in a file, and copies those files to an output folder. The second function, process_files, integrates and merges data from multiple files into a single output file for further sampling analysis.

- **Input**: MapleRealErrorsVariation_errorEstimation_estimatedErrors.txt
- **Output**:
```python
output_directories = ["/nfs/research/goldman/zihao/Datas/p1/File_5_annot/Annot_RATIO.txt"]
```
- **Address**:
```python
/nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_Treat_all_pos.py
```

##### Code block:
###### 1) To exclude IDs that do not match the coverage data, the set of checks needs to be modified
```bash
sed -i 's|checkid_file = "/nfs/research/goldman/zihao/errorsProject_1/MAPLE/TEST_50000/MAPLE0.3.2_rateVar_errors_realData_checkingErrors_50000_estimatedErrors.txt"|checkid_file = "/new/path/to/estimatedErrors.txt"|g' /nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_Treat_all_pos.py
```
###### 2) Modify the path of the folder where all files corresponding to the current ID are stored
```bash
sed -i 's|middle_output_folder = "/nfs/research/goldman/zihao/Datas/p1/File_5_annot/PLOT_FOR_Annot/"|middle_output_folder = "/new/path/to/new_folder"|g' /nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_Treat_all_pos.py
```

###### 3) Run the program

```bash
bsub -M 2000 -e /nfs/research/goldman/zihao/errorsProject_1/Annot/Treat_all_pos_errorChecking_error.txt 'python3 /nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_Treat_all_pos.py'
```

***

#### 2.2.1_Sampling_of_all_positions

- **Idea**: 
    - In this code, I have defined a function called merge_files that reads two input files and writes specific rows from the second file into an output file based on matching ID and Position from the first file.

    - The function starts by initializing a set called id_position_set to store unique combinations of ID and Position. Then, it reads the data file using the pandas library and iterates over each row. For each row, it combines the ID and Position values into a string and adds it to the id_position_set.

    - Next, the function reads the input file line by line. It skips the title line using the next function. Then, it opens the output file for writing and writes the title line to it. For each subsequent line in the input file, it splits the line into fields using the tab separator. It extracts the ID and Position values from the fields and combines them into a string. If this string is found in the id_position_set, it writes the entire line to the output file.

- **Input**: 
```python
data_file = '/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/selected_data.txt'
input_file = '/nfs/research/goldman/zihao/Datas/p1/File_5_annot/Annot_RATIO.txt'
```

- **Output**:
```python
output_directories = ["/nfs/research/goldman/zihao/Datas/p1/File_5_annot/Annot_RATIO_sampling.txt"]
```

- **Address**:
```python
/nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_sampling_all_pos.py
```

##### Code block:
```bash
bsub -M 20000 -e /nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_sampling_all_errorChecking_error.txt 'python3 /nfs/research/goldman/zihao/errorsProject_1/Annot/Annot_sampling_all_pos.py'
```

***
***
# 4.Visualization
[Return to Catalog](#Catalog:)

### 1.Density_chart_(python)

```python
for_annot_AF = ["./EBI_INTER/Project_1/Step_1.2_Annot/3_Final_plot_Annot-May.ipynb"]
for_coverage = ["./EBI_INTER/Project_1/Step_1.1_Coverage/3_Final_plot_Coverage-May.ipynb"]
```

### 2.Venn_chart_(R)

```python
# 1. Prepare the data for plotting
dir_1 = ["./EBI_INTER/Project_1/Step_2_Visualization_Venn/May.9_prepare_data.ipynb"]
# 2. Plot
dir_2 = ["./EBI_INTER/Project_1/Step_2_Visualization_Venn/May.2_Venn.Rmd"]
```
