# Generating a Uniform Allele Distribution Dataset

This notebook outlines the process for generating synthetic Adaptive Immune Receptor Repertoire (AIRR) sequences using GenAIRR. The focus is on leveraging GenAIRR's modular pipeline structure to create a dataset with uniformly distributed allele combinations. The dataset can be used for benchmarking and analyzing various bioinformatics tools and methods.

## Initial Setup and Imports

In this section, we import the necessary modules from GenAIRR and related Python libraries. This includes:
- Core GenAIRR classes for constructing simulation pipelines.
- Steps for sequence simulation and various augmentation processes.
- Mutation models and data configurations.


In [1]:
# Importing GenAIRR classes
from GenAIRR.pipeline import AugmentationPipeline
from GenAIRR.steps import SimulateSequence,FixVPositionAfterTrimmingIndexAmbiguity,FixDPositionAfterTrimmingIndexAmbiguity,FixJPositionAfterTrimmingIndexAmbiguity
from GenAIRR.steps import CorrectForVEndCut,CorrectForDTrims,CorruptSequenceBeginning,InsertNs,InsertIndels,ShortDValidation,DistillMutationRate
from GenAIRR.mutation import S5F
from GenAIRR.steps.StepBase import AugmentationStep
from GenAIRR.parameters import ChainType,CHAIN_TYPE_INFO
from GenAIRR.data import builtin_heavy_chain_data_config
import pandas as pd
import random
from itertools import product
# Initialize DataConfig with the path to your configuration
#data_config = DataConfig('/path/to/your/config')
# Or Use one of Our Builtin Data Configs
data_config_builtin = builtin_heavy_chain_data_config()



### Preparing Allele Lists
We extract all possible alleles from the DataConfig to create lists of V, D, and J alleles. This helps in forming combinations for the dataset.

In [2]:
V_alleles = [i for j in data_config_builtin.v_alleles for i in data_config_builtin.v_alleles[j]]
D_alleles = [i for j in data_config_builtin.d_alleles for i in data_config_builtin.d_alleles[j]]
J_alleles = [i for j in data_config_builtin.j_alleles for i in data_config_builtin.j_alleles[j]]



### Generating Allele Combinations

We use Python's itertools.product to create all possible combinations of V-D, D-J, and V-D-J alleles.

This approach ensures that each possible combination of alleles is represented in the dataset, allowing for comprehensive benchmarking of tools and analyses that require diverse sequence inputs.

In [3]:
# Repeat each allele K times
V_alleles = V_alleles
D_alleles = D_alleles
J_alleles = J_alleles


# Shuffle each list
random.shuffle(V_alleles)
random.shuffle(D_alleles)
random.shuffle(J_alleles)

# Generate uniform combinations
V_D_combinations = list(product(V_alleles, D_alleles))
D_J_combinations = list(product(D_alleles, J_alleles))
V_D_J_combinations = list(product(V_alleles, D_alleles,J_alleles))

all_combinations = V_D_combinations+D_J_combinations+V_D_J_combinations

total_combinations = len(V_D_combinations) + len(D_J_combinations) + len(V_D_J_combinations)



In [4]:
print(f"Number of V-D combinations: {len(V_D_combinations):,}")
print(f"Number of D-J combinations: {len(D_J_combinations):,}")
print(f"Number of V-D-J combinations: {len(V_D_J_combinations):,}")
print(f"Total number of combinations: {len(V_D_combinations) + len(D_J_combinations) + len(V_D_J_combinations):,}")

Number of V-D combinations: 6,534
Number of D-J combinations: 231
Number of V-D-J combinations: 45,738
Total number of combinations: 52,503


### Setting Up the Augmentation Pipeline
In this step, we set up a GenAIRR simulation pipeline, leveraging the built-in BCR HeavyChain pipeline as the base structure. The objective is to iterate over the list of allele combinations generated earlier and dynamically update the first step of the pipeline to enforce specific V/D/J alleles for each sequence. This approach ensures that each generated sequence adheres to predefined allele combinations, maintaining control and consistency in the simulated dataset.

In [7]:
from tqdm.auto import tqdm
from GenAIRR.mutation import Uniform # as an example we will use a Uniform Mutation Model
# Set the dataconfig and the chain type for the simulations
AugmentationStep.set_dataconfig(config=CHAIN_TYPE_INFO[ChainType.BCR_HEAVY].dataconfig,chain_type=ChainType.BCR_HEAVY)

pipeline_steps = [
    SimulateSequence(mutation_model = Uniform(min_mutation_rate=0.003,max_mutation_rate=0.25)),
    FixVPositionAfterTrimmingIndexAmbiguity(),
    FixDPositionAfterTrimmingIndexAmbiguity(),
    FixJPositionAfterTrimmingIndexAmbiguity(),
    CorrectForVEndCut(),
    CorrectForDTrims(),
    CorruptSequenceBeginning(corruption_probability = 0.7,corrupt_events_proba = [0.4,0.4,0.2],max_sequence_length = 576,nucleotide_add_coefficient = 210,
                             nucleotide_remove_coefficient = 310,nucleotide_add_after_remove_coefficient = 50,random_sequence_add_proba = 1,
                             single_base_stream_proba = 0,duplicate_leading_proba = 0,random_allele_proba = 0),
    InsertNs(n_ratio = 0.02,proba = 0.5),
    ShortDValidation(short_d_length= 5),
    InsertIndels(indel_probability = 0.5,max_indels = 5,insertion_proba=0.5,deletion_proba=0.5),
    DistillMutationRate()
    ]


simulated_sequences = []
n_repeats = 5

# Below can be replaced / rpeated for any specific combinations
for V,D in tqdm(V_D_combinations):
        sim_step = SimulateSequence(mutation_model = Uniform(min_mutation_rate=0.003,max_mutation_rate=0.25),
                         productive = False,
                         specific_v=V,
                         specific_d=D)
        pipeline_steps[0] = sim_step
        # set the above sequence simulation step as the first step in the pipeline
        pipeline = AugmentationPipeline(pipeline_steps)
        for _ in range(n_repeats): # generate n_repeats copies of the same combinations
            simulated_sequences.append(pipeline.execute().get_dict())



  0%|          | 0/6534 [00:00<?, ?it/s]

In [8]:
print(f"Simulated Sequences : {len(simulated_sequences):,}")

Simulated Sequences : 32,670


In [9]:
pd.DataFrame(simulated_sequences)

Unnamed: 0,sequence,v_call,d_call,j_call,c_call,v_sequence_start,v_sequence_end,d_sequence_start,d_sequence_end,j_sequence_start,...,c_trim_3,productive,stop_codon,vj_in_frame,note,corruption_event,corruption_add_amount,corruption_remove_amount,corruption_removed_section,corruption_added_section
0,CAGATGCAGCTCCAGGAGTCAAGCTCACGGCTGGCCTGGCCTTTAC...,[IGHVF3-G12*04],"[IGHD2-21*01, IGHD2-21*02]",[IGHJ5*02],[IGHA1*01],0,301,303,308,321,...,35.0,False,True,False,Stop codon present.,no-corruption,0,0,,
1,CAGCTGCAGCTGCAGAGAGTCCGGCTCAGGACTGGTGAAGCCTTCA...,[IGHVF3-G12*04],"[IGHD2-21*01, IGHD2-21*02]",[IGHJ6*03],[IGHG4*06],0,301,303,314,323,...,37.0,False,True,False,Stop codon present.,no-corruption,0,0,,
2,AGTCATCGGTAAAGGCCCTTGAGGGGAGAGGGTGCATCAATCCTAA...,[IGHVF3-G12*04],[IGHD2-21*01],[IGHJ1*01],[IGHG3*17],0,178,202,214,217,...,,False,True,False,Stop codon present.Junction length not divisib...,remove,0,121,AAGCCGCAGAAGCAGTCCTCCCGCTAAGGACTCACGCAGTCTTTAC...,
3,CAGCTGCAGCTGCGTGAGTCCGGCTCAGGACTGGTGAAGCCTTCAC...,[IGHVF3-G12*04],"[IGHD2-21*01, IGHD2-15*01, IGHD2-8*02, IGHD2-2...",[IGHJ4*02],[IGHG3*23],0,299,311,317,327,...,20.0,False,True,False,,no-corruption,0,0,,
4,TGGAGCTGGATCCGGAGGTCAGAAGGGAAGGGCCTGGAGAGGAGAG...,[IGHVF3-G12*04],[Short-D],[IGHJ6*02],[IGHG3*01],0,194,202,205,206,...,1.0,False,True,False,Stop codon present.,remove,0,105,GATCTGCAGCTGCAGGAGTTCGGCTCAAGCATTCTGGTGCCTACAC...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32665,CATGTGCAGCTGCATGAGTCGGGCCCTGAACTGTCTAGGCCTTCGC...,[IGHVF3-G7*03],[IGHD3-3*01],[IGHJ3*02],[IGHG3*09],0,296,310,320,320,...,11.0,False,False,False,V second C not present.,no-corruption,0,0,,
32666,CTACGGATTTTCACCGTCACCCACGTTTCAAAATGCTCCGTCGCAT...,[IGHVF3-G7*03],[IGHD3-3*01],[IGHJ6*03],[IGHG2*05],143,441,451,477,486,...,32.0,False,True,False,Stop codon present.Junction length not divisib...,add,143,0,,CTACGGATTTTCACCGTCACCCACGTTTCAAAATGCTCCGTCGCAT...
32667,CTCCCTCAAGAGTCGAGTCATCATATCAGTTGACCCGTCTAAGAAC...,"[IGHVF3-G7*03, IGHVF3-G6*07, IGHVF3-G7*05, IGH...",[IGHD3-3*01],[IGHJ6*03],[IGHM*04],0,110,124,144,147,...,22.0,False,True,False,Stop codon present.Junction length not divisib...,remove,0,188,CAGGTTCAGCCGTAGGAGTCGGGCACAAGACTCGTGAAGCCTTCTG...,
32668,CAGGTGCAGCTTCAGGAGNCGGGCCCAGGACCGGTGATGCGCTCTG...,[IGHVF3-G7*03],[IGHD3-3*01],[IGHJ5*02],[IGHE*03],0,296,299,314,315,...,6.0,False,True,False,,no-corruption,0,0,,
