```bash
bsub -M 2000 -e /nfs/research/goldman/zihao/errorsProject_1/Coverage/Treat_all_pos_errorChecking_error.txt 'python3 /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_Treat_all_pos.py'
```

```bash
sed -i 's|checkid_file = "/nfs/research/goldman/zihao/errorsProject_1/MAPLE/TEST_50000/output_modified.txt"|checkid_file = "/nfs/research/goldman/zihao/errorsProject_1/MAPLE/TEST_50000/MAPLE0.3.2_rateVar_errors_realData_checkingErrors_50000_estimatedErrors.txt"|g' Coverage_Treat_all_pos.py
```

### Final version

In [None]:
import csv
import os
import shutil
import glob
import random
import collections

def check_files_with_id(folder_path, checkid_file, output_folder):
    """
    Check the files in the given folder whose filenames contain the IDs in the specified files to the output folder.
    """
    id_set = set()

    with open(checkid_file, 'r') as f:
        for line in f:
            line = line.strip()
            if line.startswith('>'):
                id_set.add(line[1:])

    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for filename in os.listdir(folder_path):
        if any(id_str in filename for id_str in id_set):
            shutil.copy(os.path.join(folder_path, filename), os.path.join(output_folder, filename))
            
def process_files(input_folder, output_folder):
    """
    Integrate and merge all data (in preparation for later sampling)
    """
    # Get the list of file names in the input folder
    file_names = os.listdir(input_folder)

    # Create a new txt file to store the file content
    with open(os.path.join(output_folder, "Cov_RATIO.txt"), "wt") as output_file:
        # Create a csv writer object and set the delimiter as '\t'
        writer = csv.writer(output_file, delimiter='\t')
        # Write the column names to the output file
        writer.writerow(["ID", "Position", "Ratio"])

        # Loop through the first N files with the extension '.txt' in the input folder
        for i, file_name in enumerate(file_names):
            if file_name.endswith(".txt"):
                # Extract the file ID from the file name
                file_id = file_name.split("_")[0]
                # Open the file, skip the header, and read the Position and Ratio columns
                with open(os.path.join(input_folder, file_name), "r") as f:
                    file_lines = f.readlines()[1:]
                    # Loop through each line and write the ID, Position, and Ratio to the output file
                    for line in file_lines:
                        columns = line.strip().split("\t")
                        position = columns[0]
                        ratio = columns[2]
                        writer.writerow([file_id, position, ratio])




In [None]:
folder_path = '/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/Decompress/'
checkid_file = "/nfs/research/goldman/zihao/errorsProject_1/MAPLE/TEST_50000/MAPLE0.3.2_rateVar_errors_realData_checkingErrors_50000_estimatedErrors.txt"
middle_output_folder = '/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/PLOT_FOR_Coverage/'

output_folder = '/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/'

# Create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Run the function
check_files_with_id(folder_path, checkid_file, middle_output_folder)
process_files(middle_output_folder, output_folder)



## Sampling

```bash
bsub -M 2000 -e /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_sampling_errorChecking_error.txt 'python3 /nfs/research/goldman/zihao/errorsProject_1/Coverage/Coverage_sampling_all_pos.py'
```

想在每个位置从所有可能的ID中随机选择五个，而且数据量很大，不能一次性加载到内存中。在这种情况下，可以使用"Reservoir Sampling"算法，它是一个在有限内存下对大数据流进行随机采样的算法。

下面是使用"Reservoir Sampling"的代码。注意，由于"Reservoir Sampling"的特性，这个方法会保证每个位置的每个ID有相等的被选中概率，但是并不能保证每个位置一定会有5个不同的ID，特别是当某个位置的ID数量小于5的时候。

这种算法的基本思想是：对于每个新来的元素，我们以一定的概率决定是否将它包含在样本中。如果决定包含，我们就从当前的样本中随机选择一个元素并将其替换为新元素。随着数据流的进行，每个元素被选中的概率都是均等的。

## Reservoir Sampling Algorithm

Reservoir Sampling is a randomized algorithm used to select a random sample of `k` items from a stream or a large dataset of unknown size. The algorithm ensures that each item in the stream has an equal probability of being selected for the sample, regardless of the size of the dataset. Reservoir Sampling is particularly useful when the dataset is too large to fit into memory or when the size is unknown in advance.

The algorithm works as follows:

1. Initialize an empty reservoir of size `k` to store the sampled items.
2. Read the stream or dataset one item at a time.
3. For the first `k` items, simply add them to the reservoir.
4. For the `i`th item (where `i > k`), generate a random number `j` between `1` and `i` (inclusive).
   - If `j <= k`, replace the `j`th item in the reservoir with the `i`th item.
   - Otherwise, ignore the `i`th item and continue to the next item.
5. Repeat steps 4-5 until all items in the stream or dataset have been processed.
6. The final reservoir contains a random sample of `k` items from the stream or dataset.

The Reservoir Sampling algorithm ensures that each item in the stream has an equal probability of being selected for the sample. The probability of any specific item being in the final reservoir is `k` divided by the total number of items processed.

In [None]:
def experiment_with_data(data_file, output_file, sample_size):
    """
    Sampling using Reservoir Sampling Algorithm (five samples of data are retained at each location)
    """
    selected_data = collections.defaultdict(list)
    ## <defaultdict object> to ensure that 
    ## the memory occupied by the old value will be reclaimed by the garbage collection mechanism.
    
    with open(data_file, 'r') as file:
        next(file)
        for line in file:

            id_, position, ratio = line.strip().split('\t')
            position = int(position)
            ratio = float(ratio)

            if len(selected_data[position]) < sample_size:
                selected_data[position].append((id_, ratio))
            else:
                # Replace existing elements with a certain probability
                s = int(random.uniform(0, len(selected_data[position])))
                if s < sample_size:
                    selected_data[position][s] = (id_, ratio)

    with open(output_file, 'w') as file:
        file.write("ID\tPosition\tRatio\n")
        for position in selected_data:
            for id_, ratio in selected_data[position]:
                file.write(f"{id_}\t{position}\t{ratio}\n")

In [None]:
# Sampling
data_file = '/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/Cov_RATIO.txt'
output_file = '/nfs/research/goldman/zihao/Datas/p1/File_5_coverage/selected_data.txt'
sample_size = 5  # This is the number of samples you want to retain at each location
experiment_with_data(data_file, output_file, sample_size)